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Preface 


This book is a general introduction to machine learning that can serve as a textbook 
for students and researchers in the field. It covers fundamental modern topics in 
machine learning while providing the theoretical basis and conceptual tools needed 
for the discussion and justification of algorithms. It also describes several key aspects 
of the application of these algorithms. 

We have aimed to present the most novel theoretical tools and concepts while 
giving concise proofs, even for relatively advanced results. In general, whenever 
possible, we have chosen to favor succinctness. Nevertheless, we discuss some crucial 
complex topics arising in machine learning and highlight several open research 
questions. Certain topics often merged with others or treated with insufficient 
attention are discussed separately here and with more emphasis: for example, a 
different chapter is reserved for multi-class classification, ranking, and regression. 

Although we cover a very wide variety of important topics in machine learning, we 
have chosen to omit a few important ones, including graphical models and neural 
networks, both for the sake of brevity and because of the current lack of solid 
theoretical guarantees for some methods. 

The book is intended for students and researchers in machine learning, statistics 
and other related areas. It can be used as a textbook for both graduate and advanced 
undergraduate classes in machine learning or as a reference text for a research 
seminar. The first three chapters of the book lay the theoretical foundation for the 
subsequent material. Other chapters are mostly self-contained, with the exception 
of chapter 5 which introduces some concepts that are extensively used in later 
ones. Each chapter concludes with a series of exercises, with full solutions presented 
separately. 

The reader is assumed to be familiar with basic concepts in linear algebra, 
probability, and analysis of algorithms. However, to further help him, we present 
in the appendix a concise linear algebra and a probability review, and a short 
introduction to convex optimization. We have also collected in the appendix a 
number of useful tools for concentration bounds used in this book. 

To our knowledge, there is no single textbook covering all of the material 
presented here. The need for a unified presentation has been pointed out to us 
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every year by our machine learning students. There are several good books for 
various specialized areas, but these books do not include a discussion of other 
fundamental topics in a general manner. For example, books about kernel methods 
do not include a discussion of other fundamental topics such as boosting, ranking, 
reinforcement learning, learning automata or online learning. There also exist more 
general machine learning books, but the theoretical foundation of our book and our 
emphasis on proofs make our presentation quite distinct. 

Most of the material presented here takes its origins in a machine learning 
graduate course (Foundations of Machine Learning) taught by the first author 
at the Courant Institute of Mathematical Sciences in New York University over 
the last seven years. This book has considerably benefited from the comments 
and suggestions from students in these classes, along with those of many friends, 
colleagues and researchers to whom we are deeply indebted. 

We are particularly grateful to Corinna Cortes and Yishay Mansour who have 
both made a number of key suggestions for the design and organization of the 
material presented with detailed comments that we have fully taken into account 
and that have greatly improved the presentation. We are also grateful to Yishay 
Mansour for using a preliminary version of the book for teaching and for reporting 
his feedback to us. 

We also thank for discussions, suggested improvement, and contributions of many 
kinds the following colleagues and friends from academic and corporate research lab- 
oratories: Cyril Allauzen, Stephen Boyd, Spencer Greenberg, Lisa Hellerstein, Sanjiv 
Kumar, Ryan McDonald, Andres Munoz Medina, Tyler Neylon, Peter Norvig, Fer- 
nando Pereira, Maria Pershina, Ashish Rastogi, Michael Riley, Umar Syed, Csaba 
Szepesvari, Eugene Weinstein, and Jason Weston. 

Finally, we thank the MIT Press publication team for their help and support in 
the development of this text. 


1 Introduction 


Machine learning can be broadly defined as computational methods using experience 
to improve performance or to make accurate predictions. Here, experience refers to 
the past information available to the learner, which typically takes the form of 
electronic data collected and made available for analysis. This data could be in the 
form of digitized human-labeled training sets, or other types of information obtained 
via interaction with the environment. In all cases, its quality and size are crucial to 
the success of the predictions made by the learner. 

Machine learning consists of designing efficient and accurate prediction algo- 
rithms. As in other areas of computer science, some critical measures of the quality 
of these algorithms are their time and space complexity. But, in machine learning, 
we will need additionally a notion of sample complexity to evaluate the sample size 
required for the algorithm to learn a family of concepts. More generally, theoreti- 
cal learning guarantees for an algorithm depend on the complexity of the concept 
classes considered and the size of the training sample. 

Since the success of a learning algorithm depends on the data used, machine 
learning is inherently related to data analysis and statistics. More generally, learning 
techniques are data-driven methods combining fundamental concepts in computer 
science with ideas from statistics, probability and optimization. 


1.1 Applications and problems 


Learning algorithms have been successfully deployed in a variety of applications, 
including 
= Text or document classification, e.g., spam detection; 


= Natural language processing, e.g., morphological analysis, part-of-speech tagging, 
statistical parsing, named-entity recognition; 


= Speech recognition, speech synthesis, speaker verification; 
= Optical character recognition (OCR); 


= Computational biology applications, e.g., protein function or structured predic- 
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tion; 

= Computer vision tasks, e.g., image recognition, face detection; 
= Fraud detection (credit card, telephone) and network intrusion; 
=# Games, e.g., chess, backgammon; 

= Unassisted vehicle control (robots, navigation); 

= Medical diagnosis; 


= Recommendation systems, search engines, information extraction systems. 


This list is by no means comprehensive, and learning algorithms are applied to new 
applications every day. Moreover, such applications correspond to a wide variety of 
learning problems. Some major classes of learning problems are: 


® Classification: Assign a category to each item. For example, document classifica- 
tion may assign items with categories such as politics, business, sports, or weather 
while image classification may assign items with categories such as landscape, por- 
trait, or animal. The number of categories in such tasks is often relatively small, 
but can be large in some difficult tasks and even unbounded as in OCR, text clas- 
sification, or speech recognition. 

= Regression: Predict a real value for each item. Examples of regression include 
prediction of stock values or variations of economic variables. In this problem, the 
penalty for an incorrect prediction depends on the magnitude of the difference 
between the true and predicted values, in contrast with the classification problem, 
where there is typically no notion of closeness between various categories. 


® Ranking: Order items according to some criterion. Web search, e.g., returning 
web pages relevant to a search query, is the canonical ranking example. Many other 
similar ranking problems arise in the context of the design of information extraction 
or natural language processing systems. 


® Clustering: Partition items into homogeneous regions. Clustering is often per- 
formed to analyze very large data sets. For example, in the context of social net- 
work analysis, clustering algorithms attempt to identify “communities” within large 
groups of people. 

# Dimensionality reduction or manifold learning: Transform an initial representa- 
tion of items into a lower-dimensional representation of these items while preserving 
some properties of the initial representation. A common example involves prepro- 
cessing digital images in computer vision tasks. 


The main practical objectives of machine learning consist of generating accurate 
predictions for unseen items and of designing efficient and robust algorithms to 
produce these predictions, even for large-scale problems. To do so, a number of 
algorithmic and theoretical questions arise. Some fundamental questions include: 
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Figure 1.1 The zig-zag line on the left panel is consistent over the blue and red 
training sample, but it is a complex separation surface that is not likely to generalize 
well to unseen data. In contrast, the decision surface on the right panel is simpler 
and might generalize better in spite of its misclassification of a few points of the 
training sample. 


Which concept families can actually be learned, and under what conditions? How 
well can these concepts be learned computationally? 


1.2 Definitions and terminology 


We will use the canonical problem of spam detection as a running example to 
illustrate some basic definitions and to describe the use and evaluation of machine 
learning algorithms in practice. Spam detection is the problem of learning to 
automatically classify email messages as either SPAM or non-SPAM. 


= Examples: Items or instances of data used for learning or evaluation. In our spam 
problem, these examples correspond to the collection of email messages we will use 
for learning and testing. 


= Features: The set of attributes, often represented as a vector, associated to an 
example. In the case of email messages, some relevant features may include the 
length of the message, the name of the sender, various characteristics of the header, 
the presence of certain keywords in the body of the message, and so on. 


= Labels: Values or categories assigned to examples. In classification problems, 
examples are assigned specific categories, for instance, the SPAM and non-SPAM 
categories in our binary classification problem. In regression, items are assigned 
real-valued labels. 

= Training sample: Examples used to train a learning algorithm. In our spam 
problem, the training sample consists of a set of email examples along with their 
associated labels. The training sample varies for different learning scenarios, as 
described in section 1.4. 


= Validation sample: Examples used to tune the parameters of a learning algorithm 
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when working with labeled data. Learning algorithms typically have one or more 
free parameters, and the validation sample is used to select appropriate values for 
these model parameters. 


= Test sample: Examples used to evaluate the performance of a learning algorithm. 
The test sample is separate from the training and validation data and is not made 
available in the learning stage. In the spam problem, the test sample consists of a 
collection of email examples for which the learning algorithm must predict labels 
based on features. These predictions are then compared with the labels of the test 
sample to measure the performance of the algorithm. 


= Loss function: A function that measures the difference, or loss, between a pre- 
dicted label and a true label. Denoting the set of all labels as Y and the set of 
possible predictions as Y’, a loss function L is a mapping L: Y x Y’ — R,. In most 
cases, VY’ = Y and the loss function is bounded, but these conditions do not always 
hold. Common examples of loss functions include the zero-one (or misclassification) 
loss defined over {—1,+1} x {-1,+1} by L(y,y’) = ly yy and the squared loss 
defined over I x I by L(y,y’) = (y’ — y)?, where I C R is typically a bounded 
interval. 


= Hypothesis set: A set of functions mapping features (feature vectors) to the set of 
labels VY. In our example, these may be a set of functions mapping email features 
to Y = {SPAM, non-SPAM}. More generally, hypotheses may be functions mapping 
features to a different set Y’. They could be linear functions mapping email feature 
vectors to real numbers interpreted as scores (’ = R), with higher score values 
more indicative of SPAM than lower ones. 


We now define the learning stages of our spam problem. We start with a given 
collection of labeled examples. We first randomly partition the data into a training 
sample, a validation sample, and a test sample. The size of each of these samples 
depends on a number of different considerations. For example, the amount of data 
reserved for validation depends on the number of free parameters of the algorithm. 
Also, when the labeled sample is relatively small, the amount of training data is 
often chosen to be larger than that of test data since the learning performance 
directly depends on the training sample. 

Next, we associate relevant features to the examples. This is a critical step in 
the design of machine learning solutions. Useful features can effectively guide the 
learning algorithm, while poor or uninformative ones can be misleading. Although 
it is critical, to a large extent, the choice of the features is left to the user. This 
choice reflects the user’s prior knowledge about the learning task which in practice 
can have a dramatic effect on the performance results. 

Now, we use the features selected to train our learning algorithm by fixing different 
values of its free parameters. For each value of these parameters, the algorithm 
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selects a different hypothesis out of the hypothesis set. We choose among them 
the hypothesis resulting in the best performance on the validation sample. Finally, 
using that hypothesis, we predict the labels of the examples in the test sample. The 
performance of the algorithm is evaluated by using the loss function associated to 
the task, e.g., the zero-one loss in our spam detection task, to compare the predicted 
and true labels. 

Thus, the performance of an algorithm is of course evaluated based on its test error 
and not its error on the training sample. A learning algorithm may be consistent, 
that is it may commit no error on the examples of the training data, and yet 
have a poor performance on the test data. This occurs for consistent learners 
defined by very complex decision surfaces, as illustrated in figure 1.1, which tend 
to memorize a relatively small training sample instead of seeking to generalize well. 
This highlights the key distinction between memorization and generalization, which 
is the fundamental property sought for an accurate learning algorithm. Theoretical 
guarantees for consistent learners will be discussed with great detail in chapter 2. 


1.3. Cross-validation 


In practice, the amount of labeled data available is often too small to set aside 
a validation sample since that would leave an insufficient amount of training data. 
Instead, a widely adopted method known as n-fold cross-validation is used to exploit 
the labeled data both for model selection (selection of the free parameters of the 
algorithm) and for training. 

Let @ denote the vector of free parameters of the algorithm. For a fixed value 
of 8, the method consists of first randomly partitioning a given sample S of 
m labeled examples into n subsamples, or folds. The ith fold is thus a labeled 
sample ((2i1, Yi1),---;(®im;; Yim; )) of size m;. Then, for any i € [1, n], the learning 
algorithm is trained on all but the ith fold to generate a hypothesis h;, and the 
performance of h; is tested on the ith fold, as illustrated in figure 1.2a. The 
parameter value @ is evaluated based on the average error of the hypotheses h,, 
which is called the cross-validation error. This quantity is denoted by Rov(8) and 
defined by 


1 n 1 Mi 

Rov (8) = a s md S > L(hi(aiz), vig) 
i=1 0" j=l 

_ 

error of h; on the ith fold 


The folds are generally chosen to have equal size, that is m; = m/n for all 7 € [1, n]. 
How should n be chosen? The appropriate choice is subject to a trade-off and the 
topic of much learning theory research that we cannot address in this introductory 
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Figure 1.2 n-fold cross validation. (a) Illustration of the partitioning of the 
training data into 5 folds. (b) Typical plot of a classifier’s prediction error as a 
function of the size of the training sample: the error decreases as a function of the 
number of training points. 


chapter. For a large n, each training sample used in n-fold cross-validation has size 
m—m/n = m(1—1/n) (illustrated by the right vertical red line in figure 1.2b), which 
is close to m, the size of the full sample, but the training samples are quite similar. 
Thus, the method tends to have a small bias but a large variance. In contrast, 
smaller values of n lead to more diverse training samples but their size (shown by 
the left vertical red line in figure 1.2b) is significantly less than m, thus the method 
tends to have a smaller variance but a larger bias. 

In machine learning applications, n is typically chosen to be 5 or 10. n-fold cross 
validation is used as follows in model selection. The full labeled data is first split 
into a training and a test sample. The training sample of size m is then used to 
compute the n-fold cross-validation error Rcv(0) for a small number of possible 
values of 0. @ is next set to the value 09 for which Rov (0) is smallest and the 
algorithm is trained with the parameter setting 09 over the full training sample of 
size m. Its performance is evaluated on the test sample as already described in the 
previous section. 

The special case of n-fold cross validation where n = m is called leave-one-out 
cross-validation, since at each iteration exactly one instance is left out of the training 
sample. As shown in chapter 4, the average leave-one-out error is an approximately 
unbiased estimate of the average error of an algorithm and can be used to derive 
simple guarantees for some algorithms. In general, the leave-one-out error is very 
costly to compute, since it requires training n times on samples of size m — 1, but 
for some algorithms it admits a very efficient computation (see exercise 10.9). 

In addition to model selection, n-fold cross validation is also commonly used for 
performance evaluation. In that case, for a fixed parameter setting 0, the full labeled 
sample is divided into n random folds with no distinction between training and test 
samples. The performance reported is the n-fold cross-validation on the full sample 
as well as the standard deviation of the errors measured on each fold. 
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1.4 Learning scenarios 


We next briefly describe common machine learning scenarios. These scenarios differ 
in the types of training data available to the learner, the order and method by which 
training data is received and the test data used to evaluate the learning algorithm. 


= Supervised learning: The learner receives a set of labeled examples as training 
data and makes predictions for all unseen points. This is the most common scenario 
associated with classification, regression, and ranking problems. The spam detection 
problem discussed in the previous section is an instance of supervised learning. 


=» Unsupervised learning: The learner exclusively receives unlabeled training data, 
and makes predictions for all unseen points. Since in general no labeled exam- 
ple is available in that setting, it can be difficult to quantitatively evaluate the 
performance of a learner. Clustering and dimensionality reduction are example of 
unsupervised learning problems. 


= Semi-supervised learning: The learner receives a training sample consisting of 
both labeled and unlabeled data, and makes predictions for all unseen points. Semi- 
supervised learning is common in settings where unlabeled data is easily accessible 
but labels are expensive to obtain. Various types of problems arising in applications, 
including classification, regression, or ranking tasks, can be framed as instances 
of semi-supervised learning. The hope is that the distribution of unlabeled data 
accessible to the learner can help him achieve a better performance than in the 
supervised setting. The analysis of the conditions under which this can indeed 
be realized is the topic of much modern theoretical and applied machine learning 
research. 


= Transductive inference: As in the semi-supervised scenario, the learner receives 
a labeled training sample along with a set of unlabeled test points. However, the 
objective of transductive inference is to predict labels only for these particular test 
points. Transductive inference appears to be an easier task and matches the scenario 
encountered in a variety of modern applications. However, as in the semi-supervised 
setting, the assumptions under which a better performance can be achieved in this 
setting are research questions that have not been fully resolved. 


# On-line learning: In contrast with the previous scenarios, the online scenario 
involves multiple rounds and training and testing phases are intermixed. At each 
round, the learner receives an unlabeled training point, makes a prediction, receives 
the true label, and incurs a loss. The objective in the on-line setting is to minimize 
the cumulative loss over all rounds. Unlike the previous settings just discussed, no 
distributional assumption is made in on-line learning. In fact, instances and their 
labels may be chosen adversarially within this scenario. 
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=» Reinforcement learning: The training and testing phases are also intermixed in 
reinforcement learning. To collect information, the learner actively interacts with the 
environment and in some cases affects the environment, and receives an immediate 
reward for each action. The object of the learner is to maximize his reward over 
a course of actions and iterations with the environment. However, no long-term 
reward feedback is provided by the environment, and the learner is faced with the 
exploration versus exploitation dilemma, since he must choose between exploring 
unknown actions to gain more information versus exploiting the information already 
collected. 


= Active learning: The learner adaptively or interactively collects training examples, 
typically by querying an oracle to request labels for new points. The goal in 
active learning is to achieve a performance comparable to the standard supervised 
learning scenario, but with fewer labeled examples. Active learning is often used 
in applications where labels are expensive to obtain, for example computational 
biology applications. 


In practice, many other intermediate and somewhat more complex learning scenarios 
may be encountered. 


1.5 Outline 


This book presents several fundamental and mathematically well-studied algo- 
rithms. It discusses in depth their theoretical foundations as well as their practical 
applications. The topics covered include: 


= Probably approximately correct (PAC) learning framework; learning guarantees 
for finite hypothesis sets; 


= Learning guarantees for infinite hypothesis sets, Rademacher complexity, VC- 
dimension; 


= Support vector machines (SVMs), margin theory; 


= Kernel methods, positive definite symmetric kernels, representer theorem, rational 
kernels; 


= Boosting, analysis of empirical error, generalization error, margin bounds; 


= Online learning, mistake bounds, the weighted majority algorithm, the exponen- 
tial weighted average algorithm, the Perceptron and Winnow algorithms; 


= Multi-class classification, multi-class SVMs, multi-class boosting, one-versus-all, 
one-versus-one, error-correction methods; 


=# Ranking, ranking with SVMs, RankBoost, bipartite ranking, preference-based 
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ranking; 

= Regression, linear regression, kernel ridge regression, support vector regression, 
Lasso; 

= Stability-based analysis, applications to classification and regression; 

= Dimensionality reduction, principal component analysis (PCA), kernel PCA, 
Johnson-Lindenstrauss lemma; 

= Learning automata and languages; 

= Reinforcement learning, Markov decision processes, planning and learning prob- 
lems. 


The analyses in this book are self-contained, with relevant mathematical concepts 
related to linear algebra, convex optimization, probability and statistics included in 
the appendix. 


2 ‘The PAC Learning Framework 


Several fundamental questions arise when designing and analyzing algorithms that 
learn from examples: What can be learned efficiently? What is inherently hard to 
learn? How many examples are needed to learn successfully? Is there a general model 
of learning? In this chapter, we begin to formalize and address these questions by 
introducing the Probably Approximately Correct (PAC) learning framework. The 
PAC framework helps define the class of learnable concepts in terms of the number 
of sample points needed to achieve an approximate solution, sample complexity, and 
the time and space complexity of the learning algorithm, which depends on the cost 
of the computational representation of the concepts. 

We first describe the PAC framework and illustrate it, then present some general 
learning guarantees within this framework when the hypothesis set used is finite, 
both for the consistent case where the hypothesis set used contains the concept to 
learn and for the opposite inconsistent case. 


2.1 The PAC learning model 


We first introduce several definitions and the notation needed to present the PAC 
model, which will also be used throughout much of this book. 

We denote by ¥ the set of all possible examples or instances. X is also sometimes 
referred to as the input space. The set of all possible labels or target values is denoted 
by Y. For the purpose of this introductory chapter, we will limit ourselves to the 
case where ) is reduced to two labels, Y = {0,1}, so-called binary classification. 
Later chapters will extend these results to more general settings. 

A concept c: X — Y isa mapping from ¥ to Y. Since Y = {0,1}, we can identify 
c with the subset of X over which it takes the value 1. Thus, in the following, we 
equivalently refer to a concept to learn as a mapping from ¥ to {0,1}, or toa 
subset of ”. As an example, a concept may be the set of points inside a triangle or 
the indicator function of these points. In such cases, we will say in short that the 
concept to learn is a triangle. A concept class is a set of concepts we may wish to 
learn and is denoted by C’. This could, for example, be the set of all triangles in the 
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plane. 

We assume that examples are independently and identically distributed (i.i.d.) 
according to some fixed but unknown distribution D. The learning problem is then 
formulated as follows. The learner considers a fixed set of possible concepts H, 
called a hypothesis set, which may not coincide with C. He receives a sample 
S = (#1,...,%m) drawn i.i.d. according to D as well as the labels (c(a1),...,¢(@m)), 
which are based on a specific target concept c € C' to learn. His task is to use the 
labeled sample S to select a hypothesis hs € H that has a small generalization 
error with respect to the concept c. The generalization error of a hypothesis h € H, 
also referred to as the true error or just error of h is denoted by R(h) and defined 
as follows.! 


Definition 2.1 Generalization error 
Given a hypothesis h € H, a target concept c € C, and an underlying distribution 
D, the generalization error or risk of h is defined by 


R(h) = Pr [a(a) A e(x)] = E [Ince pe(a)] » (2.1) 


where 1,, is the indicator function of the event w.2 


The generalization error of a hypothesis is not directly accessible to the learner 
since both the distribution D and the target concept c are unknown. However, the 
learner can measure the empirical error of a hypothesis on the labeled sample S. 


Definition 2.2 Empirical error 
Given a hypothesis h © H, a target concept c € C, and a sample S = (a1,...,2m), 
the empirical error or empirical risk of h is defined by 


m 


Zs 1 
i=1 


Thus, the empirical error of h € H is its average error over the sample S, while the 
generalization error is its expected error based on the distribution D. We will see in 
this chapter and the following chapters a number of guarantees relating to these two 
quantities with high probability, under some general assumptions. We can already 
note that for a fixed h € H, the expectation of the empirical error based on an i.i.d. 


1. The choice of R instead of EF to denote an error avoids possible confusions with the 
notation for expectations and is further justified by the fact that the term risk is also used 
in machine learning and statistics to refer to an error. 

2. For this and other related definitions, the family of functions H and the target concept 
c must be measurable. The function classes we consider in this book all have this property. 
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sample S' is equal to the generalization error: 
E[R(h)] = Rh). (2.3) 


Indeed, by the linearity of the expectation and the fact that the sample is drawn 
i.i.d., we can write 


1 m 1 m 
gym lCh)] = eo » ml Lp(a) )JAce(ai)] = Sah do sdf ‘gis [Mateietetey ls 


w=1 


for any z in sample S. Thus, 


nan 


jm RM) = 5 EB Lncerzecy] = ,E,[inceyee(y] = Rh). 


The following introduces the Probably Approximately Correct (PAC) learning 
framework. We denote by O(n) an upper bound on the cost of the computational 
representation of any element x € ¥ and by size(c) the maximal cost of the 
computational representation of c € C. For example, x may be a vector in R”, 
for which the cost of an array-based representation would be in O(n). 


Definition 2.3 PAC-learning 
A concept class C is said to be PAC-learnable if there exists an algorithm A and 
a polynomial function poly(-,-,:,-) such that for any « > 0 and 6 > 0, for all 
distributions D on X and for any target concept c € C, the following holds for any 
sample size m > poly(1/e,1/d,n, size(c)): 
<e(>1—o. : 

Pr, [Rlhs) <q 21-0 (2.4) 
If A further runs in poly(1/e,1/6,n, size(c)), then C is said to be efficiently PAC- 
learnable. When such an algorithm A exists, it is called a PAC-learning algorithm 
for C. 


A concept class C' is thus PAC-learnable if the hypothesis returned by the algorithm 
after observing a number of points polynomial in 1/e and 1/6 is approximately 
correct (error at most €) with high probability (at least 1 — 6), which justifies the 
PAC terminology. 6 > 0 is used to define the confidence 1—6 and € > 0 the accuracy 
1—e. Note that if the running time of the algorithm is polynomial in 1/e and 1/6, 
then the sample size m must also be polynomial if the full sample is received by the 
algorithm. 

Several key points of the PAC definition are worth emphasizing. First, the PAC 
framework is a distribution-free model : no particular assumption is made about the 
distribution D from which examples are drawn. Second, the training sample and the 
test examples used to define the error are drawn according to the same distribution 
D. This is a necessary assumption for generalization to be possible in most cases. 
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Figure 2.1 Target concept R and possible hypothesis R’. Circles represent training 
instances. A blue circle is a point labeled with 1, since it falls within the rectangle 
R. Others are red and labeled with 0. 


Finally, the PAC framework deals with the question of learnability for a concept 
class C' and not a particular concept. Note that the concept class C' is known to the 
algorithm, but of course target concept c € C is unknown. 

In many cases, in particular when the computational representation of the con- 
cepts is not explicitly discussed or is straightforward, we may omit the polynomial 
dependency on n and size(c) in the PAC definition and focus only on the sample 
complexity. 

We now illustrate PAC-learning with a specific learning problem. 


Example 2.1 Learning axis-aligned rectangles 

Consider the case where the set of instances are points in the plane, 4% = R?, and 
the concept class C is the set of all axis-aligned rectangles lying in R?. Thus, each 
concept c is the set of points inside a particular axis-aligned rectangle. The learning 
problem consists of determining with small error a target axis-aligned rectangle 
using the labeled training sample. We will show that the concept class of axis- 
aligned rectangles is PAC-learnable. 

Figure 2.1 illustrates the problem. R represents a target axis-aligned rectangle 
and R’ a hypothesis. As can be seen from the figure, the error regions of R’ are 
formed by the area within the rectangle R but outside the rectangle R’ and the area 
within R’ but outside the rectangle R. The first area corresponds to false negatives, 
that is, points that are labeled as 0 or negatively by R’, which are in fact positive 
or labeled with 1. The second area corresponds to false positives, that is, points 
labeled positively by R’ which are in fact negatively labeled. 

To show that the concept class is PAC-learnable, we describe a simple PAC- 
learning algorithm A. Given a labeled sample S, the algorithm consists of returning 
the tightest axis-aligned rectangle R’ = Rs containing the points labeled with 1. 
Figure 2.2 illustrates the hypothesis returned by the algorithm. By definition, Rs 
does not produce any false positive, since its points must be included in the target 
concept R. Thus, the error region of Rs is included in R. 
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Figure 2.2 Illustration of the hypothesis R’ = Rs returned by the algorithm. 


Let R € C be a target concept. Fix € > 0. Let Pr[Rs] denote the probability mass 
of the region defined by Rs, that is the probability that a point randomly drawn 
according to D falls within Rs. Since errors made by our algorithm can be due only 
to points falling inside Rs, we can assume that Pr[Rs] > ¢; otherwise, the error of 
Rs is less than or equal to € regardless of the training sample S' received. 

Now, since Pr[Rs] > €, we can define four rectangular regions 71, r2,73, and 14 
along the sides of Rs, each with probability at least ¢/4. These regions can be 
constructed by starting with the empty rectangle along a side and increasing its 
size until its distribution mass is at least €/4. Figure 2.3 illustrates the definition of 
these regions. 

Observe that if Rs meets all of these four regions, then, because it is a rectangle, 
it will have one side in each of these four regions (geometric argument). Its error 
area, which is the part of R that it does not cover, is thus included in these regions 
and cannot have probability mass more than e. By contraposition, if R(Rs) > e, 
then Rs must miss at least one of the regions r;, i € [1,4]. As a result, we can write 


oP, (R(Rs) >< (Pr [UL {Rs ri = 0} (2.5) 
4 
< 2, ot bnlt Rsari = OF} (by the union bound) 
< 4(1—«/4)™ (since Pr[r;] > €/4) 
< 4 exp(—me/4), 


where for the last step we used the general identity 1— «2 < e~* valid for alla € R. 
For any 6 > 0, to ensure that Prgupm|[R(Rs) > €] < 6, we can impose 


4exp(—em/4) <d @m> “tog = (2.6) 


Thus, for any « > 0 and 6 > 0, if the sample size m is greater than 4 log 5, 
then Prgvpm[R(Rs) > €] < 1 — 0. Furthermore, the computational cost of the 
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Figure 2.3 Illustration of the regions r1,...,r4. 


representation of points in R? and axis-aligned rectangles, which can be defined by 
their four corners, is constant. This proves that the concept class of axis-aligned 
rectangles is PAC-learnable and that the sample complexity of PAC-learning axis- 
aligned rectangles is in O(+ log $). 

An equivalent way to present sample complexity results like (2.6), which we will 
often see throughout this book, is to give a generalization bound. It states that with 
probability at least 1 — 6, R(Rs) is upper bounded by some quantity that depends 
on the sample size m and 6. To obtain this, if suffices to set 6 to be equal to the 
upper bound derived in (2.5), that is 6 = 4exp(—me/4) and solve for «. This yields 
that with probability at least 1 — 6, the error of the algorithm is bounded as: 


R(Rs) < = log es (2.7) 
m ~ oO 


Other PAC-learning algorithms could be considered for this example. One alterna- 
tive is to return the largest axis-aligned rectangle not containing the negative points, 
for example. The proof of PAC-learning just presented for the tightest axis-aligned 
rectangle can be easily adapted to the analysis of other such algorithms. 


Note that the hypothesis set H we considered in this example coincided with the 
concept class C and that its cardinality was infinite. Nevertheless, the problem 
admitted a simple proof of PAC-learning. We may then ask if a similar proof 
can readily apply to other similar concept classes. This is not as straightforward 
because the specific geometric argument used in the proof is key. It is non-trivial 
to extend the proof to other concept classes such as that of non-concentric circles 
(see exercise 2.4). Thus, we need a more general proof technique and more general 
results. The next two sections provide us with such tools in the case of a finite 
hypothesis set. 


2.2 Guarantees for finite hypothesis sets — consistent case 17 


2.2 Guarantees for finite hypothesis sets — consistent case 


In the example of axis-aligned rectangles that we examined, the hypothesis hg 
returned by the algorithm was always consistent, that is, it admitted no error on 
the training sample S. In this section, we present a general sample complexity 
bound, or equivalently, a generalization bound, for consistent hypotheses, in the 
case where the cardinality |H| of the hypothesis set is finite. Since we consider 
consistent hypotheses, we will assume that the target concept c is in H. 


Theorem 2.1 Learning bounds — finite H, consistent case 
Let H be a finite set of functions mapping from * to Y. Let A be an algorithm that 
for any target concept c € H and i.i.d. sample S returns a consistent hypothesis hg: 
R(hs) = 0. Then, for any €,6 > 0, the inequality Prgswpm[R(hg) < ¢] > 1—6 holds 
if 

1 


1 
>A =]. ; 
m> = ( log | H| + log =) (2.8) 


This sample complexity result admits the following equivalent statement as a gener- 
alization bound: for any €,6 > 0, with probability at least 1 — 6, 


1 1 
R(hs) < — (log || + log 5): (2.9) 


Proof Fix « > 0. We do not know which consistent hypothesis hy € H is selected 
by the algorithm A. This hypothesis further depends on the training sample S. 
Therefore, we need to give a uniform convergence bound, that is, a bound that 
holds for the set of all consistent hypotheses, which a fortiori includes hg. Thus, 
we will bound the probability that some h € H would be consistent and have error 
more than e: 


Pr[sh € H: R(h) =0A R(h) > J 
= Pri(hy € H, R(hi) =0A R(hi) > ©) V (ho € H, R(h2) =0A R(h2) > 6) V-«:] 


< S Pr[R(h) =0A R(h) > €] (union bound) 
heH 

< > Pr[R(h) =0| Rh) > d. (definition of conditional probability) 
heH 


Now, consider any hypothesis h € H with R(h) > e. Then, the probability that h 
would be consistent on a training sample S drawn i.i.d., that is, that it would have 
no error on any point in S', can be bounded as: 


Pr[R(h) =0| R(h) > J < (1-6). 
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The previous inequality implies 


Pr[ah € H: R(h) =0A R(h) >] < |A|(1—6)”. 


Setting the right-hand side to be equal to 6 and solving for € concludes the proof. m™ 


The theorem shows that when the hypothesis set H is finite, a consistent algorithm 
A is a PAC-learning algorithm, since the sample complexity given by (2.8) is 
dominated by a polynomial in 1/e and 1/6. As shown by (2.9), the generalization 
error of consistent hypotheses is upper bounded by a term that decreases as 
a function of the sample size m. This is a general fact: as expected, learning 
algorithms benefit from larger labeled training samples. The decrease rate of O(1/m) 
guaranteed by this theorem, however, is particularly favorable. 

The price to pay for coming up with a consistent algorithm is the use of a 
larger hypothesis set H containing target concepts. Of course, the upper bound 
(2.9) increases with |H|. However, that dependency is only logarithmic. Note that 
the term log|H], or the related term log, |H| from which it differs by a constant 
factor, can be interpreted as the number of bits needed to represent H. Thus, the 
generalization guarantee of the theorem is controlled by the ratio of this number of 
bits, log, |H|, and the sample size m. 

We now use theorem 2.1 to analyze PAC-learning with various concept classes. 


Example 2.2 Conjunction of Boolean literals 

Consider learning the concept class C,, of conjunctions of at most n Boolean literals 
X1,-.-,Ln- A Boolean literal is either a variable x;, i € [1,n], or its negation %;. For 
n = 4, an example is the conjunction: 2, A %2 A x24, where %2 denotes the negation 
of the Boolean literal x2. (1,0,0,1) is a positive example for this concept while 
(1,0,0,0) is a negative example. 

Observe that for n = 4, a positive example (1,0,1,0) implies that the target 
concept cannot contain the literals 7, and %3 and that it cannot contain the literals 
x2 and «4. In contrast, a negative example is not as informative since it is not 
known which of its n bits are incorrect. A simple algorithm for finding a consistent 
hypothesis is thus based on positive examples and consists of the following: for each 
positive example (b1,...,b,) andi € [1, nJ, if b; = 1 then Z; is ruled out as a possible 
literal in the concept class and if b; = 0 then x; is ruled out. The conjunction of all 
the literals not ruled out is thus a hypothesis consistent with the target. Figure 2.4 
shows an example training sample as well as a consistent hypothesis for the case 
n=6. 

We have |H| = |C,| = 3”, since each literal can be included positively, with 
negation, or not included. Plugging this into the sample complexity bound for 
consistent hypotheses yields the following sample complexity bound for any € > 0 
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Figure 2.4 Each of the first six rows of the table represents a training example with 
its label, + or —, indicated in the last column. The last row contains 0 (respectively 
1) in column i € [1, 6] if the th entry is 0 (respectively 1) for all the positive examples. 
It contains “?” if both 0 and 1 appear as an ith entry for some positive example. 
Thus, for this training sample, the hypothesis returned by the consistent algorithm 
described in the text is 71 A v2 Aas A x6. 


and 6 > 0: 


1 1 
m> - ((log3)n + log 5): (2.10) 


Thus, the class of conjunctions of at most n Boolean literals is PAC-learnable. Note 
that the computational complexity is also polynomial, since the training cost per 
example is in O(n). For 6 = 0.02, €« = 0.1, and n = 10, the bound becomes m > 149. 
Thus, for a labeled sample of at least 149 examples, the bound guarantees 99% 
accuracy with a confidence of at least 987%. 


Example 2.3 Universal concept class 

Consider the set Y = {0,1}" of all Boolean vectors with n components, and let U, 
be the concept class formed by all subsets of 4’. Is this concept class PAC-learnable? 
To guarantee a consistent hypothesis the hypothesis class must include the concept 
class, thus |H| > |U,| = 2°"). Theorem 2.1 gives the following sample complexity 
bound: 


1 be ori A 
m> = ((log 2)2” + log =): (2.11) 


Here, the number of training samples required is exponential in n, which is the cost 
of the representation of a point in VY. Thus, PAC-learning is not guaranteed by 


the theorem. In fact, it is not hard to show that this universal concept class is not 
PAC-learnable. 
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Example 2.4 k-term DNF formulae 

A disjunctive normal form (DNF) formula is a formula written as the disjunction of 
several terms, each term being a conjunction of Boolean literals. A k-term DNF is a 
DNF formula defined by the disjunction of k terms, each term being a conjunction 
of at most n Boolean literals. Thus, for k = 2 and n = 3, an example of a k-term 
DMF is (a1 A Xo x3) V (%1 /\ v3). 

Is the class C of k-term DNF formulae is PAC-learnable? The cardinality of the 
class is 3"", since each term is a conjunction of at most n variables and there are 
3” such conjunctions, as seen previously. The hypothesis set H must contain C’ for 
consistency to be possible, thus |H| > 3”"*. Theorem 2.1 gives the following sample 
complexity bound: 


1 1 
m> = (log 3)nk + log 5) (2.12) 


which is polynomial. However, it can be shown that the problem of learning k- 
term DNF is in RP, the complexity class of problems that admit a randomized 
polynomial-time decision solution. The problem is therefore computationally in- 
tractable unless RP = NP, which is commonly conjectured not to be the case. Thus, 
while the sample size needed for learning k-term DNF formulae is only polynomial, 
efficient PAC-learning of this class is not possible unless RP = NP. 


Example 2.5 k-CNF formulae 

A conjunctive normal form (CNF) formula is a conjunction of disjunctions. A k- 
CNF formula is an expression of the form T, A... AT; with arbitrary length 7 € N 
and with each term T; being a disjunction of at most k Boolean attributes. 

The problem of learning k-CNF formulae can be reduced to that of learning 
conjunctions of Boolean literals, which, as seen previously, is a PAC-learnable 
concept class. To do so, it suffices to associate to each term JT; a new variable. 
Then, this can be done with the following bijection: 

aj (a1) V+ ++ V ai(&n) > Vad (er) ;.0j3 ays (2.13) 


where a;(x;) denotes the assignment to x; in term T;. This reduction to PAC- 
learning of conjunctions of Boolean literals may affect the original distribution, but 
this is not an issue since in the PAC framework no assumption is made about the 
distribution. Thus, the PAC-learnability of conjunctions of Boolean literals implies 
that of k-CNF formulae. 

This is a surprising result, however, since any k-term DNF formula can be written 
as a k-CNF formula. Indeed, using associativity, a k-term DNF can be rewritten as 
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a k-CNF formula via 


n 


k 
VF aie) A+++ A ain) = A (ai) V---V ae(@i,)- 


iy yeeyip=l 
To illustrate this rewriting in a specific case, observe, for example, that 


3 
(u1 /\ U2 uz) V (v1 JA v2 A U3) — \ (Ui \ v;). 
i,j=1 

But, as we previously saw, k-term DNF formulae are not efficiently PAC-learnable! 
What can explain this apparent inconsistency? Observe that the number of new 
variables needed to write a k-term DNF as a k-CNF formula via the transformation 
just described is exponential in k, it is in O(n*). The discrepancy comes from the size 
of the representation of a concept. A k-term DNF formula can be an exponentially 
more compact representation, and efficient PAC-learning is intractable if a time- 
complexity polynomial in that size is required. Thus, this apparent paradox deals 
with key aspects of PAC-learning, which include the cost of the representation of a 
concept and the choice of the hypothesis set. 


2.3. Guarantees for finite hypothesis sets — inconsistent case 


In the most general case, there may be no hypothesis in H consistent with the 
labeled training sample. This, in fact, is the typical case in practice, where the 
learning problems may be somewhat difficult or the concept classes more complex 
than the hypothesis set used by the learning algorithm. However, inconsistent 
hypotheses with a small number of errors on the training sample can be useful and, 
as we shall see, can benefit from favorable guarantees under some assumptions. This 
section presents learning guarantees precisely for this inconsistent case and finite 
hypothesis sets. 

To derive learning guarantees in this more general setting, we will use Hoeffding’s 
inequality (theorem D.1) or the following corollary, which relates the generalization 
error and empirical error of a single hypothesis. 
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Corollary 2.1 
Fiz « > 0 and let S denote an 2i.d. sample of size m. Then, for any hypothesis 
h: X — {0,1}, the following inequalities hold: 


: Pr. [R(h) — R(h) > €] < exp(—2me’) (2.14) 
F Pr [R(h) — R(h) < —e < exp(—2me?). (2.15) 


By the union bound, this implies the following two-sided inequality: 
Pr_ [|R(h) — R(h)| > €] < 2exp(—2me?). (2.16) 
Proof The result follows immediately theorem D.1. m 


Setting the right-hand side of (2.16) to be equal to 6 and solving for € yields 
immediately the following bound for a single hypothesis. 


Corollary 2.2 Generalization bound — single hypothesis 
Fix a hypothesis h: X — {0,1}. Then, for any 6 > 0, the following inequality holds 
with probability at least 1 — 6: 


R(h) < R(h) + (2.17) 


The following example illustrates this corollary in a simple case. 


Example 2.6 Tossing a coin 

Imagine tossing a biased coin that lands heads with probability p, and let our 
hypothesis be the one that always guesses heads. Then the true error rate is R(h) = p 
and the empirical error rate R(h) = p, where p is the empirical probability of 
heads based on the training sample drawn i.i.d. Thus, corollary 2.2 guarantees with 


probability at least 1 — 6 that 
Hog 2 
Ip—pI< oe, (2.18) 


Therefore, if we choose 6 = 0.02 and use a sample of size 500, with probability at 
least 98%, the following approximation quality is guaranteed for p: 


az log(10) 


lp—pl < 1000 & 0.048. (2.19) 


Can we readily apply corollary 2.2 to bound the generalization error of the 
hypothesis hs returned by a learning algorithm when training on a sample S? No, 
since hg is not a fixed hypothesis, but a random variable depending on the training 
sample S drawn. Note also that unlike the case of a fixed hypothesis for which 
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the expectation of the empirical error is the generalization error (equation 2.3), the 
generalization error R(hg) is a random variable and in general distinct from the 
expectation E[R(hs)], which is a constant. 

Thus, as in the proof for the consistent case, we need to derive a uniform con- 
vergence bound, that is a bound that holds with high probability for all hypotheses 


he HH. 


Theorem 2.2 Learning bound — finite H, inconsistent case 
Let H be a finite hypothesis set. Then, for any 6 > 0, with probability at least 1—6, 
the following inequality holds: 


is log |H| + log 2 
Whe H, R(h) < R(h) +4/ 08 | ie ary (2.20) 
m 


Proof Let hy,...,h)4); be the elements of H. Using the union bound and applying 
corollary 2.2 to each hypothesis yield: 


Pr [ar € H|R(h) — R(h)| > ( 
= Pr |(|R(i1) = R(a)| > €) V.-- V (RO) — RO ))| > 9] 
< So Pr [| 2) — R(h)| > ( 


heH 
< 2|H|exp(—2me’). 


Setting the right-hand side to be equal to 6 completes the proof. m 


Thus, for a finite hypothesis set H, 


R(h) < R(h) +O ( eel) , 


m 


As already pointed out, log, |H| can be interpreted as the number of bits needed 
to represent H. Several other remarks similar to those made on the generalization 
bound in the consistent case can be made here: a larger sample size m guarantees 
better generalization, and the bound increases with |H|, but only logarithmically. 
But, here, the bound is a less favorable function of 1082 il it varies as the square 
root of this term. This is not a minor price to pay: for a fixed |H|, to attain the 
same guarantee as in the consistent case, a quadratically larger labeled sample is 
needed. 

Note that the bound suggests seeking a trade-off between reducing the empirical 
error versus controlling the size of the hypothesis set: a larger hypothesis set is 
penalized by the second term but could help reduce the empirical error, that is the 
first term. But, for a similar empirical error, it suggests using a smaller hypothesis 
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set. This can be viewed as an instance of the so-called Occam’s Razor principle 
named after the theologian William of Occam: Plurality should not be posited without 
necessity, also rephrased as, the simplest explanation is best. In this context, it could 
be expressed as follows: All other things being equal, a simpler (smaller) hypothesis 
set is better. 


2.4 Generalities 


In this section we will consider several important questions related to the learning 
scenario, which we left out of the discussion of the earlier sections for simplicity. 


2.4.1 Deterministic versus stochastic scenarios 


In the most general scenario of supervised learning, the distribution D is defined 
over ¥ x Y, and the training data is a labeled sample S drawn i.i.d. according to 
D: 


S= ((@1, 11) .0s,(2msYm))- 


The learning problem is to find a hypothesis h € H with small generalization error 


R(h) = Pr h(a) #ul= B Uncera: 

This more general scenario is referred to as the stochastic scenario. Within this 
setting, the output label is a probabilistic function of the input. The stochastic 
scenario captures many real-world problems where the label of an input point is not 
unique. For example, if we seek to predict gender based on input pairs formed by 
the height and weight of a person, then the label will typically not be unique. For 
most pairs, both male and female are possible genders. For each fixed pair, there 
would be a probability distribution of the label being male. 

The natural extension of the PAC-learning framework to this setting is known as 
the agnostic PAC-learning. 


Definition 2.4 Agnostic PAC-learning 

Let H be a hypothesis set. A is an agnostic PAC-learning algorithm if there 
exists a polynomial function poly(-,-,:,-) such that for any e > 0 and 6 > 0, 
for all distributions D over X x Y, the following holds for any sample size m > 
poly(1/e, 1/6, n, size(c)): 


ee lR(hs) ~ min Rh) < eg >1-6. (2.21) 
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If A further runs in poly(1/e, 1/6, n, size(c)), then it is said to be an efficient agnostic 
PAC-learning algorithm. 


When the label of a point can be uniquely determined by some measurable func- 
tion f: ¥ — Y (with probability one), then the scenario is said to be deterministic. 
In that case, it suffices to consider a distribution D over the input space. The 
training sample is obtained by drawing (21,...,2%m) according to D and the labels 
are obtained via f: y; = f(a#;) for all i € [1,m]. Many learning problems can be 
formulated within this deterministic scenario. 

In the previous sections, as well as in most of the material presented in this book, 
we have restricted our presentation to the deterministic scenario in the interest of 
simplicity. However, for all of this material, the extension to the stochastic scenario 
should be straightforward for the reader. 


2.4.2 Bayes error and noise 


In the deterministic case, by definition, there exists a target function f with no 
generalization error: R(h) = 0. In the stochastic case, there is a minimal non-zero 
error for any hypothesis. 


Definition 2.5 Bayes error 
Given a distribution D over X x Y, the Bayes error R* is defined as the infimum 
of the errors achieved by measurable functions h: X — y: 


RX = inf R(h). (2.22) 


h 
h measurable 


A hypothesis h with R(h) = R* is called a Bayes hypothesis or Bayes classifier. 


By definition, in the deterministic case, we have R* = 0, but, in the stochastic case, 
R* #0. Clearly, the Bayes classifier hpayes can be defined in terms of the conditional 
probabilities as: 


Va EX, hpayes(x) = argmax Priya]. (2.23) 
ye {0,1} 


The average error made by hpayes on & € X is thus min{Pr(0|z], Pr[1|z]}, and this 
is the minimum possible error. This leads to the following definition of noise. 


Definition 2.6 Noise 
Given a distribution D over X x Y, the noise at point x € X is defined by 


noise(~) = min{Pr{[1|2], Pr[O|a}}. (2.24) 


The average noise or the noise associated to D is E[noise(x)]. 
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Thus, the average noise is precisely the Bayes error: noise = E[noise(x)] = R*. The 
noise is a characteristic of the learning task indicative of its level of difficulty. A 
point x € 4X, for which noise(z) is close to 1/2, is sometimes referred to as noisy 
and is of course a challenge for accurate prediction. 


2.4.3 Estimation and approximation errors 


The difference between the error of a hypothesis h € H and the Bayes error can be 
decomposed as: 


R(h) — R* = (R(h) — R(h*)) + (R(h*) — R*), (2.25) 
ees 
estimation approximation 


where h* is a hypothesis in H with minimal error, or a best-in-class hypothesis.® 


The second term is referred to as the approximation error, since it measures how 
well the Bayes error can be approximated using H. It is a property of the hypothesis 
set H, a measure of its richness. The approximation error is not accessible, since 
in general the underlying distribution D is not known. Even with various noise 
assumptions, estimating the approximation error is difficult. 

The first term is the estimation error, and it depends on the hypothesis h 
selected. It measures the quality of the hypothesis h with respect to the best-in-class 
hypothesis. The definition of agnostic PAC-learning is also based on the estimation 
error. The estimation error of an algorithm A, that is, the estimation error of the 
hypothesis hg returned after training on a sample S, can sometimes be bounded in 
terms of the generalization error. 

For example, let i denote the hypothesis returned by the empirical risk 
minimization algorithm, that is the algorithm that returns a hypothesis i with 
the smallest empirical error. Then, the generalization bound given by theorem 2.2, 
or any other bound on suppe yz |R(h) — R(h)|, can be used to bound the estimation 
error of the empirical risk minimization algorithm. Indeed, rewriting the estimation 
error to make R(hERM) appear and using R(hERM) < R(h*), which holds by the 
definition of the algorithm, we can write 


R(WERM) _ R(h*) = R(WERM) — R(WERM) +. R(AERM) — R(h*) 
< R(RERM) — R(AERM) + R(h*) — R(h*) 
< 2 sup |R(h) — R(h)|. (2.26) 


heH 


3. When H is a finite hypothesis set, h* necessarily exists; otherwise, in this discussion 
R(h*) can be replaced by infnex R(h). 
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error bound on generalization error 


complexity term 


training error 


measure of capacity 


Figure 2.5 Illustration of structural risk minimization. The plots of three errors 
are shown as a function of a measure of capacity. Clearly, as the size or capacity of 
the hypothesis set increases, the training error decreases, while the complexity term 
increases. SRM selects the hypothesis minimizing a bound on the generalization 
error, which is a sum of the empirical error, and the complexity term is shown in 
red. 


The right-hand side of (2.26) can be bounded by theorem 2.2 and increases with 
the size of the hypothesis set, while R(h*) decreases with |H|. 


2.4.4 Model selection 


Here, we discuss some broad model selection and algorithmic ideas based on the 
theoretical results presented in the previous sections. We assume an i.i.d. labeled 
training sample S of size m and denote the error of a hypothesis h on S by Rs(h) 
to explicitly indicate its dependency on S. 

While the guarantee of theorem 2.2 holds only for finite hypothesis sets, it already 
provides us with some useful insights for the design of algorithms and, as we will see 
in the next chapters, similar guarantees hold in the case of infinite hypothesis sets. 
Such results invite us to consider two terms: the empirical error and a complexity 
term, which here is a function of |H| and the sample size m. 

In view of that, the ERM algorithm , which only seeks to minimize the error on 
the training sample 

hgM = argmin Rs(h), (2.27) 
heH 
might not be successful, since it disregards the complexity term. In fact, the 
performance of the ERM algorithm is typically very poor in practice. Additionally, 
in many cases, determining the ERM solution is computationally intractable. For 
example, finding a linear hypothesis with the smallest error on the training sample 
is NP-hard (as a function of the dimension of the space). 
Another method known as structural risk minimization (SRM) consists of con- 
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sidering instead an infinite sequence of hypothesis sets with increasing sizes 
Ho CM, C-::C Ay::: (2.28) 


and to find the ERM solution hE®™ for each H,. The hypothesis selected is the 


AERM solutions with the smallest sum of the empirical error and 


one among the 
a complexity term complexity(H,,,m) that depends on the size (or more generally 
the capacity, that is, another measure of the richness of H) of H,,, and the sample 


size m: 


ASEM — argmin Rg(h) + complexity(H,,m). (2.29) 
heHn 
neN 


Figure 2.5 illustrates the SRM method. While SRM benefits from strong theoretical 
guarantees, it is typically computationally very expensive, since it requires deter- 
mining the solution of multiple ERM problems. Note that the number of ERM 
problems is not infinite if for some n the minimum empirical error is zero: The 
objective function can only be larger for n’ > n. 

An alternative family of algorithms is based on a more straightforward optimiza- 
tion that consists of minimizing the sum of the empirical error and a regularization 
term that penalizes more complex hypotheses. The regularization term is typically 
defined as ||h||? for some norm || - || when H is a vector space: 


ABEG — argmin Rg(h) + Allhll?. (2.30) 
heH 


A > 0 is a regularization parameter, which can be used to determine the trade-off 
between empirical error minimization and control of the complexity. In practice, 
is typically selected using n-fold cross-validation. In the next chapters, we will see 
a number of different instances of such regularization-based algorithms. 


2.5 Chapter notes 


The PAC learning framework was introduced by Valiant [1984]. The book of Kearns 
and Vazirani [1994] is an excellent reference dealing with most aspects of PAC- 
learning and several other foundational questions in machine learning. Our example 
of learning axis-aligned rectangles is based on that reference. 

The PAC learning framework is a computational framework since it takes into 
account the cost of the computational representations and the time complexity of 
the learning algorithm. If we omit the computational aspects, it is similar to the 
learning framework considered earlier by Vapnik and Chervonenkis [see Vapnik, 
2000}. 
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Occam’s razor principle is invoked in a variety of contexts, such as in linguistics to 
justify the superiority of a set of rules or syntax. The Kolmogorov complexity can be 
viewed as the corresponding framework in information theory. In the context of the 
learning guarantees presented in this chapter, the principle suggests selecting the 
most parsimonious explanation (the hypothesis set with the smallest cardinality). 
We will see in the next sections other applications of this principle with different 
notions of simplicity or complexity. The idea of structural risk minimization (SRM) 
is due to Vapnik [1998). 


2.6 Exercises 


2.1 Two-oracle variant of the PAC model. Assume that positive and negative 
examples are now drawn from two separate distributions D, and D_. For an 
accuracy (1 —«), the learning algorithm must find a hypothesis h such that: 

Pr [h(a) =0] <eand Pr [h(x)=1] <e. (2.31) 

an D+. a~D_ 

Thus, the hypothesis must have a small error on both distributions. Let C’ be any 
concept class and H be any hypothesis space. Let ho and h, represent the identically 
0 and identically 1 functions, respectively. Prove that C is efficiently PAC-learnable 
using H in the standard (one-oracle) PAC model if and only if it is efficiently PAC- 
learnable using H U {ho, hi} in this two-oracle PAC model. 


2.2 PAC learning of hyper-rectangles. An axis-aligned hyper-rectangle in R” is a 
set of the form [a1, bi] x ... X [dn, bn]. Show that axis-aligned hyper-rectangles are 
PAC-learnable by extending the proof given in Example 2.1 for the case n = 2. 


2.3 Concentric circles. Let X = R? and consider the set of concepts of the form 
c = {(x,y): x? + y? < r?} for some real number r. Show that this class can be 
(e,0)-PAC-learned from training data of size m > (1/e) log(1/0). 


2.4 Non-concentric circles. Let X = R? and consider the set of concepts of the form 
c= {x € R?: ||2—x9|| <r} for some point 29 € R? and real number r. Gertrude, an 
aspiring machine learning researcher, attempts to show that this class of concepts 
may be (e,d)-PAC-learned with sample complexity m > (3/e)log(3/6), but she is 
having trouble with her proof. Her idea is that the learning algorithm would select 
the smallest circle consistent with the training data. She has drawn three regions 
r1,72,73 around the edge of concept c, with each region having probability €/3 (see 
figure 2.6). She wants to argue that if the generalization error is greater than or 
equal to €, then one of these regions must have been missed by the training data, 
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Figure 2.6 Gertrude’s regions r1,r2,73. 


and hence this event will occur with probability at most 6. Can you tell Gertrude 
if her approach works? 


2.5 Triangles. Let X = R? with orthonormal basis (e1,e2), and consider the set of 
concepts defined by the area inside a right triangle ABC with two sides parallel to 
the axes, with AB/||AB|| = e; and AC’/|| AC|| = ee, and || AB||/|| AC] = a for some 
positive real a € R,. Show, using similar methods to those used in the chapter for 
the axis-aligned rectangles, that this class can be (e,4)-PAC-learned from training 
data of size m > (3/e) log(3/6). 


2.6 Learning in the presence of noise — rectangles. In example 2.1, we showed that 
the concept class of axis-aligned rectangles is PAC-learnable. Consider now the case 
where the training points received by the learner are subject to the following noise: 
points negatively labeled are unaffected by noise but the label of a positive training 
point is randomly flipped to negative with probability 7 € (0, s). The exact value of 
the noise rate 7 is not known to the learner but an upper bound 7/ is supplied to him 
with 7 < 7’ < 1/2. Show that the algorithm described in class returning the tightest 
rectangle containing positive points can still PAC-learn axis-aligned rectangles in 
the presence of this noise. To do so, you can proceed using the following steps: 


(a) Using the same notation as in example 2.1, assume that Pr[R] > ¢. Suppose 
that R(R’) > e. Give an upper bound on the probability that R’ misses a region 
rj, J € [1,4] in terms of € and 7’? 

(b) Use that to give an upper bound on Pr[R(R’) > €] in terms of € and 7’ and 
conclude by giving a sample complexity bound. 


2.7 Learning in the presence of noise — general case. In this question, we will seek 
a result that is more general than in the previous question. We consider a finite 
hypothesis set H, assume that the target concept is in H, and adopt the following 
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noise model: the label of a training point received by the learner is randomly changed 
with probability 7 € (0, 4). The exact value of the noise rate 7 is not known to the 
learner but an upper bound 1/ is supplied to him with 7 < 1 < 1/2. 


(a) For any h € H, let d(h) denote the probability that the label of a training 
point received by the learner disagrees with the one given by h. Let h* be the 
target hypothesis, show that d(h*) = 7. 

(b) More generally, show that for any h € H, d(h) = 7+ (1 — 2n) R(h), where 
R(h) denotes the generalization error of h. 

(c) Fix € > 0 for this and all the following questions. Use the previous questions 
to show that if R(h) > e, then d(h) — d(h*) > ée’, where ¢’ = e€(1 — 2r/). 

(d) For any hypothesis h € H and sample S of size m, let d(h) denote the 
fraction of the points in S whose labels disagree with those given by h. We will 
consider the algorithm L which, after receiving S, returns the hypothesis hg 
with the smallest number of disagreements (thus d(hs) is minimal). To show 
PAC-learning for L, we will show that for any h, if R(h) > ¢, then with high 
probability d(h) > d(h*). First, show that for any 5 > 0, with probability at 
least 1 — 6/2, for m > _ log 5, the following holds: 


nw 


d(h*) — d(h*) < &/2 


(e) Second, show that for any 6 > 0, with probability at least 1 — 6/2, for 
m > 4;(log|H| + log 2), the following holds for all h € H: 


d(h) — d(h) < &/2 
(f) Finally, show that for any 6 > 0, with probability at least 1 — 6, for 
m > ZC mrp (log | H|+log =), the following holds for all h € H with R(h) > e: 


== i€ 


d(h) — d(h*) > 0. 


Aw Aw Aw Aw 


(Hint: use d(h) — d(h*) = [d(h) — d(h)] + [d(h) — d(h*)] + [d(h*) — d(h*)] and 
use previous questions to lower bound each of these three terms). 


2.8 Learning union of intervals. Let [a,b] and [c,d] be two intervals of the real line 
with a< b<c<d. Let € > 0, and assume that Prp((b,c)) > €, where D is the 
distribution according to which points are drawn. 


(a) Show that the probability that m points are drawn i.i.d. without any of 
them falling in the interval (b,c) is at most e~”*. 


(b) Show that the concept class formed by the union of two closed intervals 
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in R, e.g., [a,b] U [c,d], is PAC-learnable by giving a proof similar to the one 
given in Example 2.1 for axis-aligned rectangles. (Hint: your algorithm might 
not return a hypothesis consistent with future negative points in this case.) 


2.9 Consistent hypotheses. In this chapter, we showed that for a finite hypothesis 
set H, a consistent learning algorithm A is a PAC-learning algorithm. Here, we 
consider a converse question. Let Z be a finite set of m labeled points. Suppose that 
you are given a PAC-learning algorithm A. Show that you can use A and a finite 
training sample S to find in polynomial time a hypothesis h € H that is consistent 
with Z, with high probability. (Hint: you can select an appropriate distribution D 
over Z and give a condition on R(h) for h to be consistent.) 


2.10 Senate laws. For important questions, President Mouth relies on expert advice. 
He selects an appropriate advisor from a collection of H = 2,800 experts. 


(a) Assume that laws are proposed in a random fashion independently and 
identically according to some distribution D determined by an unknown group 
of senators. Assume that President Mouth can find and select an expert senator 
out of H who has consistently voted with the majority for the last m = 200 
laws. Give a bound on the probability that such a senator incorrectly predicts 
the global vote for a future law. What is the value of the bound with 95% 
confidence? 

(b) Assume now that President Mouth can find and select an expert senator 


out of H who has consistently voted with the majority for all but m’ = 20 of 
the last m = 200 laws. What is the value of the new bound? 


3 Rademacher Complexity and VC- 
Dimension 


The hypothesis sets typically used in machine learning are infinite. But the sample 
complexity bounds of the previous chapter are uninformative when dealing with 
infinite hypothesis sets. One could ask whether efficient learning from a finite sample 
is even possible when the hypothesis set H is infinite. Our analysis of the family of 
axis-aligned rectangles (Example 2.1) indicates that this is indeed possible at least 
in some cases, since we proved that that infinite concept class was PAC-learnable. 
Our goal in this chapter will be to generalize that result and derive general learning 
guarantees for infinite hypothesis sets. 

A general idea for doing so consists of reducing the infinite case to the analysis 
of finite sets of hypotheses and then proceed as in the previous chapter. There 
are different techniques for that reduction, each relying on a different notion of 
complexity for the family of hypotheses. The first complexity notion we will use 
is that of Rademacher complexity. This will help us derive learning guarantees 
using relatively simple proofs based on McDiarmid’s inequality, while obtaining 
high-quality bounds, including data-dependent ones, which we will frequently make 
use of in future chapters. However, the computation of the empirical Rademacher 
complexity is NP-hard for some hypothesis sets. Thus, we subsequently introduce 
two other purely combinatorial notions, the growth function and the VC-dimension. 
We first relate the Rademacher complexity to the growth function and then bound 
the growth function in terms of the VC-dimension. The VC-dimension is often easier 
to bound or estimate. We will review a series of examples showing how to compute 
or bound it, then relate the growth function and the VC-dimensions. This leads to 
generalization bounds based on the VC-dimension. Finally, we present lower bounds 
based on the VC-dimension both in the realizable and non-realizable cases, which 
will demonstrate the critical role of this notion in learning. 


34 Rademacher Complexity and VC-Dimension 


3.1. Rademacher complexity 


We will continue to use H to denote a hypothesis set as in the previous chapters, 
and h an element of H. Many of the results of this section are general and hold for 
an arbitrary loss function LD: Y x Y — R. To each h: Y — JY, we can associate a 
function g that maps (7, y) € & x Y to L(h(x), y) without explicitly describing the 
specific loss Z used. In what follows G will generally be interpreted as the family of 
loss functions associated to H. 

The Rademacher complexity captures the richness of a family of functions by 
measuring the degree to which a hypothesis set can fit random noise. The following 
states the formal definitions of the empirical and average Rademacher complexity. 


Definition 3.1 Empirical Rademacher complexity 

Let G be a family of functions mapping from Z to [a,b] and S = (z,...,2m) a fixed 
sample of size m with elements in Z. Then, the empirical Rademacher complexity 
of G with respect to the sample S' is defined as: 


7 |geG M7 


R5(G) =E [op = ¥ naa] ; (3.1) 


where o = (01,---;0m)', with ojs independent uniform random variables taking 


values in {—1, +1}.1 The random variables o; are called Rademacher variables. 


Let gs denote the vector of values taken by function g over the sample S: gg = 
(g(z1),---,9(2m))'. Then, the empirical Rademacher complexity can be rewritten 
as 


Ry(G) =E sup = = : 
elgeG ™ 
The inner product o - gg measures the correlation of gs with the vector of random 
¢8° is a measure of how well the function class G 
correlates with o over the sample S. Thus, the empirical Rademacher complexity 
measures on average how well the function class G correlates with random noise 
on 5. This describes the richness of the family G: richer or more complex families 


G can generate more vectors gs and thus better correlate with random noise, on 


noise oa. The supremum supyeq 


average. 


1. We assume implicitly that the supremum over the family G in this definition is 
measurable and in general will adopt the same assumption throughout this book for other 
suprema over a class of functions. This assumption does not hold for arbitrary function 
classes but it is valid for the hypotheses sets typically considered in practice in machine 
learning, and the instances discussed in this book. 
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Definition 3.2 Rademacher complexity 

Let D denote the distribution according to which samples are drawn. For any 
integer m > 1, the Rademacher complexity of G is the expectation of the empirical 
Rademacher complexity over all samples of size m drawn according to D: 


Hm (G) = , E Rs(@)]. (3.2) 


We are now ready to present our first generalization bounds based on Rademacher 
complexity. 


Theorem 3.1 
Let G be a family of functions mapping from Z to [0,1]. Then, for any 6 > 0, with 
probability at least 1 — 6, each of the following holds for all g € G: 


Elg(2)] <=) 9(2i) + 2%m(G) + =e (3.3) 
and Blg(2)} < +) gle) +285(G) +3] ES. (3.4) 


Proof For any sample S = (2,...,2m) and any g € G, we denote by Es(g] the 
empirical average of g over S: Eg[g] = + 97", g(zi). The proof consists of applying 


m 


McDiarmid’s inequality to function ® defined for any sample S by 


nan 


®(S) = sup Eg] — Es[g]. (3.5) 
gEG 
Let S and S’ be two samples differing by exactly one point, say z, in S and zi, 
in S’. Then, since the difference of suprema does not exceed the supremum of the 
difference, we have 


me A Zm) — 9( Zh, 1 
5(5') — 8(S) < sup Bis[g] — Big[g] = sup 22M — 9) <1. (356) 
geG gcG m m 
Similarly, we can obtain ®(S) — ®(S’) < 1/m, thus |®(S) — ®($’)| < 1/m. Then, 
by McDiarmid’s inequality, for any 6 > 0, with probability at least 1 — 6/2, the 
following holds: 


(3) < E[®(S)] + (3.7) 
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We next bound the expectation of the right-hand side as follows: 


B[8(S)] = E | sup Ely] — Es(g)| 


yer 
= 8 [sup B [Bsx(9) ~ Bs(o)]] (3.8) 
< B, [sup Es-(9) - Bs(o)| (3.9) 
3. Ytote ~ 9(2))| (3.10) 
a 2 oi(9(24) — 9(21))] (3.11) 
=e, [sup ~ , oi9(2')| 7 oS | Hae ~ 3 —ai9(2%)| (3.12) 
=2 5 [sup = Y nals) (z; | = = Wn (G). (3.13) 


Equation 3.8 uses the fact that points in $” are sampled in an i.i.d. fashion and thus 
E[g] = Es [Es-(g)], as in (2.3). Inequality 3.9 holds by Jensen’s inequality and the 
convexity of the supremum function. In equation 3.11, we introduce Rademacher 
variables o;8, that is uniformly distributed independent random variables taking 
values in {—1,+1} as in definition 3.2. This does not change the expectation 
appearing in (3.10): when o; = 1, the associated summand remains unchanged; 
when o; = —1, the associated summand flips signs, which is equivalent to swapping 
z, and zi between S and S’. Since we are taking the expectation over all possible S 
and $”’, this swap does not affect the overall expectation. We are simply changing the 
order of the summands within the expectation. (3.12) holds by the sub-additivity of 
the supremum function, that is the identity sup(U+V) < sup(U)+sup(V). Finally, 
(3.13) stems from the definition of Rademacher complexity and the fact that the 
variables o; and —o; are distributed in the same way. 

The reduction to ®,,(G) in equation 3.13 yields the bound in equation 3.3, 
using 6 instead of 6/2. To derive a bound in terms of His(G), we observe that, 
by definition 3.2, changing one point in S changes Rs(G) by at most 1/m. Then, 
using again McDiarmid’s inequality, with probability 1 — 6/2 the following holds: 


Rm(G) < Rg(G) + (3.14) 


Finally, we use the union bound to combine inequalities 3.7 and 3.14, which yields 
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with probability at least 1 — 06: 


: (3.15) 


which matches (3.4). ™ 


The following result relates the empirical Rademacher complexities of a hypothe- 
sis set H and to the family of loss functions G associated to H in the case of binary 
loss (zero-one loss). 


Lemma 3.1 

Let H be a family of functions taking values in {—1,+1} and let G be the family of 
loss functions associated to H for the zero-one loss: G = {(@,y) > Inia) gyi hE H}. 
For any sample S = ((#1,y1),---;(@m;Ym)) of elements in X x {-1,+1}, let Sx 
denote its projection over X: Sx = (41,...,U%m). Then, the following relation holds 
between the empirical Rademacher complexities of G and H: 


Fes(@) = 5s, (H). (3.16) 


Proof For any sample S = ((x1,y1),---;(%m,Ym)) of elements in ¥ x {—1,+1}, 
by definition, the empirical Rademacher complexity of G can be written as: 


Rs(G) =E B | sup = = Yritnonen| 


heH ™m 
= aa 
su on 
=E [op 
1 i 
= 55 | su = —O; ih(xi)| 
=e ape yih(ai) 
“E[s % hi ) ts (HL) 
= & up — OMU)| = 5 , 
26 lnerm Sx 


where we used the fact that 1p(2,)4y, = (1 —yih(ai))/2 and the fact that for a fixed 
yi © {-1,+1}, o; and —y;0; are distributed in the same way. ™ 


Note that the lemma implies, by taking expectations, that for any m > 1, Rin(G) = 
$Rm(H). These connections between the empirical and average Rademacher com- 
plexities can be used to derive generalization bounds for binary classification in 
terms of the Rademacher complexity of the hypothesis set H. 


Theorem 3.2 Rademacher complexity bounds — binary classification 
Let H be a family of functions taking values in {—1,+1} and let D be the distribution 
over the input space X. Then, for any 6 > 0, with probability at least 1 — 6 over 
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a sample S of size m drawn according to D, each of the following holds for any 
he dH: 


R(h) < Rh) + Rin(H) + 08 3 (3.17) 
and R(h) < R(h) + Rs(H) +3 ae 3 (3.18) 


Proof The result follows immediately by theorem 3.1 and lemma 3.1. 


The theorem provides two generalization bounds for binary classification based on 
the Rademacher complexity. Note that the second bound, (3.18), is data-dependent: 
the empirical Rademacher complexity Ro(H ) is a function of the specific sample 
S drawn. Thus, this bound could be particularly informative if we could compute 
Ro(H ). But, how can we compute the empirical Rademacher complexity? Using 
again the fact that o; and —o; are distributed in the same way, we can write 


Now, for a fixed value of o, computing infnex + >)", oih(a;) is equivalent to 
an empirical risk minimization problem, which is known to be computationally 
hard for some hypothesis sets. Thus, in some cases, computing Ro(H ) could 
be computationally hard. In the next sections, we will relate the Rademacher 
complexity to combinatorial measures that are easier to compute. 


3.2. Growth function 


Here we will show how the Rademacher complexity can be bounded in terms of the 
growth function. 


Definition 3.3 Growth function 
The growth function Iq: N— N for a hypothesis set H is defined by: 


Ym EN, Ig(m) = eee l{(r (21) h(tm)): hE H}|. (3.19) 
fata sites x 

Thus, I(m) is the maximum number of distinct ways in which m points can be 

classified using hypotheses in H. This provides another measure of the richness of 

the hypothesis set H. However, unlike the Rademacher complexity, this measure 

does not depend on the distribution, it is purely combinatorial. 
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To relate the Rademacher complexity to the growth function, we will use Mas- 
sart’s lemma. 


Theorem 3.3 Massart’s lemma 
Let ACR” be a finite set, with r = maxxea ||X||2, then the following holds: 


“ \/2log|A 
E E sup Yon < py 2log ial (3.20) 
m 


7 LM xeAiy 


where o;8 are independent uniform random variables taking values in {—1,+1} and 
Z1,--.,L%m are the components of vector x. 


Proof For any t > 0, using Jensen’s inequality, rearranging terms, and bounding 
the supremum by a sum, we obtain: 


Hog Soee) <6 (oltre) 
exp ( eae a I: exp np 
=E (sup exp [ed> aia] ) < YB (exp [td oie)). 
7 \reA i=1 2 i=] 


ZEA 


We next use the independence of the o;s, then apply Hoeffding’s lemma (lemma D.1), 
and use the definition of r to write: 


m 


exp (Ge [ sup > oie] ) < 2 TT, E (exp [to;x;]) 


io 
tEA Ga] reA 


t?(2x,;)? 
< x ITj* exp | 


ZEA 


= xs exp fea < os exp | = |Alen=. 
i=l 


zEA zEA 


Taking the log of both sides and dividing by t gives us: 


= log|A| tr? 
E [sup oie < oe | + a (3.21) 


aif A . iets : 
If we choose t = ee which minimizes this upper bound, we get: 


E sup s ve < ry/2 log Al. (3.22) 


o 
ZEA j=l 


Dividing both sides by m leads to the statement of the lemma. 
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Using this result, we can now bound the Rademacher complexity in terms of the 
growth function. 


Corollary 3.1 
Let G be a family of functions taking values in {—1,+1}. Then the following holds: 


2 log IT 
woe (3.23) 

m 
Proof For a fixed sample S = (2x1,...,2%m), we denote by Gg the set of vectors 
of function values (g(x1),--.,9(%m))' where g is in G. Since g € G takes values 


in {—1,+1}, the norm of these vectors is bounded by ,/m. We can then apply 
Massart’s lemma as follows: 


Rin (G) =e E sup Soe < 7 ee 


o 
ueG\s m i=1 


By definition, |Gjs| is bounded by the growth function, thus, 


JVmvr/2 cet) _ —— IIg(m) 


? 
m m 


< 
Rm(G) <E 
which concludes the proof. m 


Combining the generalization bound (3.17) of theorem 3.2 with corollary 3.1 yields 
immediately the following generalization bound in terms of the growth function. 


Corollary 3.2 Growth function generalization bound 
Let H be a family of functions taking values in {—1,+1}. Then, for any 6 > 0, with 
probability at least 1— 06, for anyh € H, 


R(h) < R(h) 4 yf etal) ! neas (3.24) 


Growth function bounds can be also derived directly (without using Rademacher 
complexity bounds first). The resulting bound is then the following: 


2 


R(h) — R(n)| = ( < ATI ,(2m) exp (-=) (3.25) 


Pr | 3 


which only differs from (3.24) by constants. 

The computation of the growth function may not be always convenient since, by 
definition, it requires computing IIq(m) for all m > 1. The next section introduces 
an alternative measure of the complexity of a hypothesis set H that is based instead 
on a single scalar, which will turn out to be in fact deeply related to the behavior 
of the growth function. 
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Figure 3.1 VC-dimension of intervals on the real line. (a) Any two points can be 
shattered. (b) No sample of three points can be shattered as the (+,—,+) labeling 
cannot be realized. 


3.3. VC-dimension 


Here, we introduce the notion of VC-dimension (Vapnik-Chervonenkis dimension). 
The VC-dimension is also a purely combinatorial notion but it is often easier to 
compute than the growth function (or the Rademacher Complexity). As we shall 
see, the VC-dimension is a key quantity in learning and is directly related to the 
growth function. 

To define the VC-dimension of a hypothesis set H, we first introduce the concepts 
of dichotomy and that of shattering. Given a hypothesis set H, a dichotomy of a 
set S is one of the possible ways of labeling the points of S using a hypothesis in 
H. A set S of m > 1 points is said to be shattered by a hypothesis set H when H 
realizes all possible dichotomies of S, that is when Iy(m) = 2”. 


Definition 3.4 VC-dimension 
The VC-dimension of a hypothesis set H is the size of the largest set that can be 
fully shattered by H: 


VCdim(H) = max{m: Ty(m) = 2""}. (3.26) 


Note that, by definition, if VCdim(H) = d, there exists a set of size d that can 
be fully shattered. But, this does not imply that all sets of size d or less are fully 
shattered, in fact, this is typically not the case. 

To further illustrate this notion, we will examine a series of examples of hypothesis 
sets and will determine the VC-dimension in each case. To compute the VC- 
dimension we will typically show a lower bound for its value and then a matching 
upper bound. To give a lower bound d for VCdim(/7), it suffices to show that a set 
S of cardinality d can be shattered by H. To give an upper bound, we need to prove 
that no set S' of cardinality d+ 1 can be shattered by H, which is typically more 
difficult. 


Example 3.1 Intervals on the real line 
Our first example involves the hypothesis class of intervals on the real line. 
It is clear that the VC-dimension is at least two, since all four dichotomies 


42 Rademacher Complexity and VC-Dimension 


(a) (b) 
Figure 3.2 Unrealizable dichotomies for four points using hyperplanes in R?. (a) 
All four points lie on the convex hull. (b) Three points lie on the convex hull while 
the remaining point is interior. 


(+,+),(—,—-),(4,—-), (—, +) can be realized, as illustrated in figure 3.1(a). In con- 
trast, by the definition of intervals, no set of three points can be shattered since the 
(+, —,+) labeling cannot be realized. Hence, VCdim(intervals in R) = 2. 


Example 3.2 Hyperplanes 

Consider the set of hyperplanes in R?. We first observe that any three non-collinear 
points in R? can be shattered. To obtain the first three dichotomies, we choose a 
hyperplane that has two points on one side and the third point on the opposite 
side. To obtain the fourth dichotomy we have all three points on the same side of 
the hyperplane. The remaining four dichotomies are realized by simply switching 
signs. Next, we show that four points cannot be shattered by considering two cases: 
(i) the four points lie on the convex hull defined by the four points, and (ii) three 
of the four points lie on the convex hull and the remaining point is internal. In 
the first case, a positive labeling for one diagonal pair and a negative labeling for 
the other diagonal pair cannot be realized, as illustrated in figure 3.2(a). In the 
second case, a labeling which is positive for the points on the convex hull and 
negative for the interior point cannot be realized, as illustrated in figure 3.2(b). 
Hence, VCdim(hyperplanes in R?) = 3. 

More generally in R?, we derive a lower bound by starting with a set of d+ 1 
points in R¢, setting x9 to be the origin and defining 2;, for i € {1,...,d}, as the 
point whose ith coordinate is 1 and all others are 0. Let yo, y1,---, ya € {—1, +1} be 
an arbitrary set of labels for 79, 21,...,%q. Let w be the vector whose ith coordinate 
is y;. Then the classifier defined by the hyperplane of equation w-a+ 4 = 0 shatters 
Xo,@1,---,Xq Since for any 7 € [0,d], 


sen (wai + 2) = sgn (vit 2) = Ye (3.27) 


To obtain an upper bound, it suffices to show that no set of d+ 2 points can be 
shattered by halfspaces. To prove this, we will use the following general theorem. 
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(b) 


Figure 3.3 VC-dimension of axis-aligned rectangles. (a) Examples of realizable 
dichotomies for four points in a diamond pattern. (b) No sample of five points can 
be realized if the interior point and the remaining points have opposite labels. 


Theorem 3.4 Radon’s theorem 
Any set X of d+2 points in R¢ can be partitioned into two subsets X, and Xz such 
that the convex hulls of X, and X2 intersect. 


Proof Let X = {xj,...,Xai2} C R¢%. The following is a system of d+ 1 linear 
equations in Q1,...,Qq+42: 


d+2 d+-2 


a;x; = 0 and a; = 0, (3.28) 
i=1 i=1 


since the first equality leads to d equations, one for each component. The number 
of unknowns, d + 2, is larger than the number of equations, d+ 1, therefore 
the system admits a non-zero solution (),...,@a+2. Since aay i; = O, both 
I, = {i € [l,d+ 2]: 8; > 0} and In = {i € [1,d + 2]: GB; < 0} are non-empty 
sets and X, = {x;:i € I} and X2 = {x;: i € Ig} form a partition of X. By the 
last equation of (3.28), )lic7, Bi = — Vier, Bi. Let 8 = Viiez, Bi. Then, the first 
part of (3.28) implies 


with Vier, F = Dien SH =1, and | > 0 for ie Ky and = > 0 for i € kh. By 


definition of the convex hulls (B.4), this implies that >7,.;, ax; belongs both to 
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|positive points] < |negative points| |positive points] > |negative points| 


(a) (b) 
Figure 3.4 Convex d-gons in the plane can shatter 2d +1 points. (a) d-gon 
construction when there are more negative labels. (b) d-gon construction when 
there are more positive labels. 


the convex hull of X; and to that of Xo. m& 


Now, let X be a set of d+ 2 points. By Radon’s theorem, it can be partitioned 
into two sets X; and X» such that their convex hulls intersect. Observe that when 
two sets of points X; and Xz are separated by a hyperplane, their convex hulls 
are also separated by that hyperplane. Thus, X; and X2 cannot be separated by 
a hyperplane and X is not shattered. Combining our lower and upper bounds, we 
have proven that VCdim(hyperplanes in R¢) = d +1. 


Example 3.3 Aais-aligned Rectangles 

We first show that the VC-dimension is at least four, by considering four points 
in a diamond pattern. Then, it is clear that all 16 dichotomies can be realized, 
some of which are illustrated in figure 3.2(a). In contrast, for any set of five distinct 
points, if we construct the minimal axis-aligned rectangle containing these points, 
one of the five points is in the interior of this rectangle. Imagine that we assign a 
negative label to this interior point and a positive label to each of the remaining 
four points, as illustrated in figure 3.2(b). There is no axis-aligned rectangle that 
can realize this labeling. Hence, no set of five distinct points can be shattered and 
VCdim(axis-aligned rectangles) = 4. 


Example 3.4 Convex Polygons 

We focus on the class of convex d-gons in the plane. To get a lower bound, we 
show that any set of 2d+ 1 points can be fully shattered. To do this, we select 
2d+ 1 points that lie on a circle, and for a particular labeling, if there are more 
negative than positive labels, then the points with the positive labels are used as 
the polygon’s vertices, as in figure 3.4(a). Otherwise, the tangents of the negative 
points serve as the edges of the polygon, as shown in (3.4)(b). To derive an upper 
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Figure 3.5 An example of a sine function (with w = 50) used for classification. 


bound, it can be shown that choosing points on the circle maximizes the number 
of possible dichotomies, and thus VCdim(convex d-gons) = 2d + 1. Note also that 
VCdim(convex polygons) = +oo. 


Example 3.5 Sine Functions 

The previous examples could suggest that the VC-dimension of H coincides with 
the number of free parameters defining H. For example, the number of parameters 
defining hyperplanes matches their VC-dimension. However, this does not hold in 
general. Several of the exercises in this chapter illustrate this fact. The following 
provides a striking example from this point of view. Consider the following family 
of sine functions: {t +> sin(wt): w € R}. One instance of this function class is shown 
in figure 3.5. These sine functions can be used to classify the points on the real line: 
a point is labeled positively if it is above the curve, negatively otherwise. Although 
this family of sine function is defined via a single parameter, w, it can be shown 
that VCdim(sine functions) = +00 (exercise 3.12). 


The VC-dimension of many other hypothesis sets can be determined or upper- 
bounded in a similar way (see this chapter’s exercises). In particular, the VC- 
dimension of any vector space of dimension r < co can be shown to be at most 
r (exercise 3.11). The next result known as Sauer’s lemma clarifies the connection 
between the notions of growth function and VC-dimension. 


Theorem 8.5 Sauer’s lemma 
Let H be a hypothesis set with VCdim(H) = d. Then, for allm €N, the following 
inequality holds: 


Hy(m) < yo ("). (3.29) 


d 
i=0 
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Gi=G\si Gog={g' CS": (g EG)A(g'U {am} € G)}. 


Figure 3.6 Illustration of how Gi and G2 are constructed in the proof of Sauer’s 
lemma. 


Proof The proof is by induction on m+d. The statement clearly holds for m = 1 
and d = 0 or d = 1. Now, assume that it holds for (m—1,d— 1) and (m— 1,d). 
Fix a set S = {21,...,2%m} with Ily(m) dichotomies and let G = Hjg be the set of 
concepts H induces by restriction to S. 

Now consider the following families over S’ = {21,...,2%m_i}. We define G, = 
Gs, as the set of concepts H includes by restriction to S’. Next, by identifying each 
concept as the set of points (in S’ or S) for which it is non-zero, we can define G2 
as 


Go = {9 CS": (g EG) A (g'U {tm} € G)}- 


Since g’ C S’, g' € G means that without adding xv, it is a concept of G. Further, 
the constraint g’ U {a} € G means that adding 2, to g’ also makes it a concept 
of G. The construction of G; and Gp» is illustrated pictorially in figure 3.6. Given 
our definitions of G; and Gg, observe that |G1| + |G2| = |G]. 

Since VCdim(G,) < VCdim(G) < d, then by definition of the growth function 
and using the induction hypothesis, 


d 
-—1 
|Gi| < Me, (m-1) <> i 
1=0 


Further, by definition of Go, ifaset Z C S’ is shattered by G2, then the set ZU{x, } 
is shattered by G. Hence, 


VCdim(G2) < VCdim(G) —1=d-1, 
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and by definition of the growth function and using the induction hypothesis, 


Thus, 


d d-1 d d 
IG| _ IG4| 4 IG < ys —) ‘) 4: yi aes a _ x gree 4 paey = yy ear 
i=0 i=0 


i=0 i=0 
which completes the inductive proof. m 


The significance of Sauer’s lemma can be seen by corollary 3.3, which remarkably 
shows that growth function only exhibits two types of behavior: either VCdim(H) = 
d < +00, in which case IIy(m) = O(m4), or VCdim(H) = +00, in which case 
Ilq(m) = 2™. 


Corollary 3.3 
Let H be a hypothesis set with VCdim(H) = d. Then for all m > d, 


em" = O(m?). (3.30) 


d 
Proof The proof begins by using Sauer’s lemma. The first inequality multiplies 
each summand by a factor that is greater than or equal to one since m > d, while 
the second inequality adds non-negative summands to the summation. 


Us(m) < ( 


After simplifying the expression using the binomial theorem, the final inequality 
follows using the general identity (l1—x2)<e"*. @ 


The explicit relationship just formulated between VC-dimension and the growth 
function combined with corollary 3.2 leads immediately to the following generaliza- 
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tion bounds based on the VC-dimension. 


Corollary 3.4. VC-dimension generalization bounds 
Let H be a family of functions taking values in {—-1,+1} with VC-dimension d. 
Then, for any 6 > 0, with probability at least 1 — 6, the following holds for all 


he H: 
Bdlog = ~— log 4 
Gig) ee ees (3.31) 
m 2m 


Thus, the form of this generalization bound is 


R(h) < R(h) 4 o( sell), (3.32) 


which emphasizes the importance of the ratio m/d for generalization. The theorem 
provides another instance of Occam’s razor principle where simplicity is measured 
in terms of smaller VC-dimension. 

VC-dimension bounds can be derived directly without using an intermediate 
Rademacher complexity bound, as for (3.25): combining Sauer’s lemma with (3.25) 
leads to the following high-probability bound 


a 8dlog 22 + 8log 4 
R(h) < R(h) 4 i ae see 


m 


sp) 


R(h) < 


which has the general form of (3.32). The log factor plays only a minor role in these 
bounds. A finer analysis can be used in fact to eliminate that factor. 


3.4 Lower bounds 


In the previous section, we presented several upper bounds on the generalization 
error. In contrast, this section provides lower bounds on the generalization error of 
any learning algorithm in terms of the VC-dimension of the hypothesis set used. 
These lower bounds are shown by finding for any algorithm a ‘bad’ distribution. 
Since the learning algorithm is arbitrary, it will be difficult to specify that particular 
distribution. Instead, it suffices to prove its existence non-constructively. At a high 
level, the proof technique used to achieve this is the probabilistic method of Paul 
Erdos. In the context of the following proofs, first a lower bound is given on the 
expected error over the parameters defining the distributions. From that, the lower 
bound is shown to hold for at least one set of parameters, that is one distribution. 
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Theorem 38.6 Lower bound, realizable case 

Let H be a hypothesis set with VC-dimension d > 1. Then, for any learning 
algorithm A, there exist a distribution D over X and a target function f © H 
such that 


d—-1 
P h ——]| > 1/100. : 
«Pt, |Rol(hs. f) > S| = 1/100 (3.33) 


Proof Let X = {xo,71,...,%a-1} C & be a set that is fully shattered by H. For 
any € > 0, we choose D such that its support is reduced to X and so that one 
point (zo) has very high probability (1 — €), with the rest of the probability mass 
distributed uniformly among the other points: 


: 8€ 
Pr[xo] =1-8e and Wie [l,d- 1], Pris] = 1 (3.34) 
With this definition, most samples would contain xo and, since X is fully shattered, 
A can essentially do no better than tossing a coin when determining the label of a 


point x; not falling in the training set. 

We assume without loss of generality that A makes no error on x9. For a sample 
3, we let S denote the set of its elements falling in {x,,...,7q_1}, and let S be the 
set of samples S' of size m such that || < (d—1)/2. Now, fix a sample S$ € S, and 
consider the uniform distribution U over all labelings f: X — {0,1}, which are all 
in H since the set is shattered. Then, the following lower bound holds: 


jEylRp(as, Al= 5 Do ewes) Prla] Pris] 


f wEx 
2 x, >, Lp(a) ZF (a) Pr{z] Pr[f] 
f «¢S 
= o> Ln(a) f(z) Pr(fl) Pr{[z] 
céS ff 
1 ld—-1 8 
== So = 2€. ; 
g 2. Priel 2 5 7 qi = (3.35) 


The first lower bound holds because we remove non-negative terms from the 
summation when we only consider x ¢ S' instead of all x in X. After rearranging 
terms, the subsequent equality holds since we are taking an expectation over f © H 
with uniform weight on each f and H shatters X. The final lower bound holds due 
to the definitions of D and S, the latter which implies that |X — S| > (d—1)/2. 
Since (3.35) holds for all S € S, it also holds in expectation over all S € S: 
Eses [Erwu[Ro(hs, f)]] > 2e. By Fubini’s theorem, the expectations can be 
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permuted, thus, 


s | B,[Ro(hs, fy)| > 26. (3.36) 


This implies that Escs[Rp(hg, fo)| > 2 for at least one labeling fo € H. Decom- 
posing this expectation into two parts and using Rp(hs, fo) < Prp|X — {xo}], we 
obtain: 


gE,lRolhs, fo)|= S°Ro(hs, fo) Pr[Rp(hs, fo)| + S>Rp(hs, fo) PrlRo(hs, fo)] 
S:Rp(hs,fo)>e S:Rp(hs,fo)<e 


< Pr[X — {xo}] Pr{Ro(hs, fo) 2 +¢ Pr[Rp(hs, fo) < €| 
< 8 Pr[Rp(hs, fo) 2 4 + e(1- Pr [Ro(hs, fo) 2 d). 


Collecting terms in Prges[Rp(hs, fo) => €] yields 


PrlRo(hs, fo) > 4 > =-(e- 0) = 2. (3.37) 


1 

7 
Thus, the probability over all samples S' (not necessarily in S 
as 


) can be lower bounded 


Pr[Rp(hs, fo) > | > Px{Ro(hs, fo) > 4 Pr{S] > = Pr{S}. (3.38) 


This leads us to find a lower bound for Pr[S]. The probability that more than 
(d—1)/2 points are drawn in a sample of size m verifies the Chernoff bound for any 
y > 0: 


2 
1 — Pr[S] = Pr[S,, > 8em(1+)] < es. (3.39) 
Therefore, for €e = (d—1)/(32m) and y= 1, 
Pr[Sin > 254) < ee VP < 1-76, (3.40) 


for 6 < .01. Thus Pr[S] > 76 and Prs[Rp(hs, fo) >«] > 06. 


The theorem shows that for any algorithm A, there exists a ‘bad’ distribution over 
X and a target function f for which the error of the hypothesis returned by A is 
(+) with some constant probability. This further demonstrates the key role played 
by the VC-dimension in learning. The result implies in particular that PAC-learning 
in the non-realizable case is not possible when the VC-dimension is infinite. 

Note that the proof shows a stronger result than the statement of the theorem: 
the distribution D is selected independently of the algorithm A. We now present a 
theorem giving a lower bound in the non-realizable case. The following two lemmas 
will be needed for the proof. 
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Lemma 3.2 


Let a be a uniformly distributed random variable taking values in {a_,a+}, where 


= 4 — 5 anda, = s +5, and let S be a sample of m > 1 random variables 


X1,...,Xm taking values in {0,1} and drawn 1.i.d. according to the distribution Da 
defined by Prp, [|X = 1] =a. Let h be a function from X™ to {a_,a+,}, then the 
following holds: 


B | Pr, [a(S) # al] > (2[m/2],¢), (3.41) 


where ®(m, €) =1( yi exp ( ae *)) for all m and e. 


1-e? 


Proof The lemma can be interpreted in terms of an experiment with two coins 
with biases a_ and a. It implies that for a discriminant rule h(.S) based on a 
sample S drawn from Do_ or Da, , to determine which coin was tossed, the sample 
size m must be at least Q(1/e?). The proof is left as an exercise (exercise 3.19). 


We will make use of the fact that for any fixed € the function m +> ®(m,z) is 
convex, which is not hard to establish. 


Lemma 3.3 
Let Z be a random variable taking values in [0,1]. Then, for any y € [0,1), 
E|Z| — 
Pr[z > y= a > E[Z] - 7. (3.42) 


Proof Since the values taken by Z are in [0, 1], 


= > Pr{Z = zjz+ Pr[Z = 
Z<y zZ>7 

< yb = zy + SP = 
Z<y zZ>¥7 


= yPr[Z < 7] + Pr[Z > 9] 
= 7(1— Pr[Z > y]) + Pr[Z > | 
=(1—7)PrlZ>%1+7, 

which concludes the proof. m 


Theorem 3.7 Lower bound, non-realizable case 
Let H be a hypothesis set with VC-dimension d > 1. Then, for any learning 
algorithm A, there exists a distribution D over X x {0,1} such that: 


/ d 
— i] > ; 
Pe Rp(hs) ue Rp(h) = 320m i 1/64. (3 43) 
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Equivalently, for any learning algorithm, the sample complexity verifies 

> —_., 3.44 

= 3206 ae 

Proof Let X = {x1,01,...,ca} C &X be a set fully shattered by H. For any 

a € [0,1] and any vector o = (a1,...,0a)' € {—1,+1}4, we define a distribution 
Dg with support X x {0,1} as follows: 


: a) (3.45) 


Vie [Ld Prien l=5(5+5 


Thus, the label of each point x;, i € [1,d], follows the distribution Prp, |-|x,], that 
of a biased coin where the bias is determined by the sign of a; and the magnitude 
of a. To determine the most likely label of each point x;, the learning algorithm 
will therefore need to estimate Prp, [1|x;] with an accuracy better than a. To make 
this further difficult, @ and o will be selected based on the algorithm, requiring, as 
in lemma 3.2, Q(1/a7) instances of each point x; in the training sample. 

Clearly, the Bayes classifier hj,_ is defined by hp, (ai) = argmaxye¢o4} Pr[yl|xi] = 
1,,>0 for alli € [1,d]. hip, is in AH since X is fully shattered. For all h € H, 


< 1 a a a 
Rp, (h) = Rp, (hp, ) = F ‘> (S + = \lncayen, (gf) ad .S In(a)¢h, (e)- (3.46) 
wEX rEX 


Let hg denote the hypothesis returned by the learning algorithm A after receiving 
a labeled sample S drawn according to Dg. We will denote by |S|, the number of 
occurrences of a point z in S. Let U denote the uniform distribution over {—1, +1}¢. 
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Then, in view of (3.46), the following holds: 


B E Pig) = Ro. (hi,)]| 


an~U 
S~D™ 
1 
= 7 oF, [nswenn, (0) 
wEX Sv D™ 
1 * 
= 7 Bel sPhy liste) 4 hb, @)]] 
LEX 
1 . d 
=7 DE, | Php [slo # hb, (2) | [Sle = 2] Prl|Sle = nl 
xzExX n=0 
1 m 
> q ®(n+ 1,a) Pr[|S|, =n] (lemma 3.2) 
zEX n=0 
1 
> i ®(m/d+1,a) (convexity of ®(-,a@) and Jensen’s ineq.) 
LEX 
= ®(m/d+1,a) 


Since the expectation over o is lower-bounded by ®(m/d-+ 1,a), there must exist 
some o € {—1,+1}¢ for which 
1 
gem + [Rp, (hs) — Rp. (hip,) > @(m/d+1,a). (3.47) 


Then, by lemma 3.3, for that o, for any y € [0,1], 


SvDm 


Pr, | -[Ro-(hs) ~ Rp, (hip,)] > 14] > (L—a)u (3.48) 


where u = ®(m/d+ 1,a). Selecting 6 and € such that 6 < (1—~)u and € < yau 
gives 


a [Rp, (hs) — Rp, (hp,) > €] > 6. (3.49) 
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To satisfy the inequalities defining € and 6, let y = 1 — 86. Then, 


1 
6<(1-yu = ues (3.50) 
1 (m/d + 1l)a? 1 
1 1l-e > 1 
= i i exp ( io al 2 (3.51) 
d+1)a? 4 
fan, MEE Bigs (3.52) 
1-—a? 3 
m 1 4 
< F ; 
rf (a 1) log 5 1 (3.53) 
Selecting a = 8¢/(1 — 86) gives € = ya/8 and the condition 
m _ ((1—86)? 4 
<i 1)1 le 54 
a= ( 64e2 ne oe) 


Let f(1/e?) denote the right-hand side. We are seeking a sufficient condition of the 
form m/d < w/e?. Since € < 1/64, to ensure that w/e? < f(1/e?), it suffices to 
impose w/(1/64)? = f(1/(1/64)?). This condition gives 


w = (7/64)? log(4/3) — (1/64)? (log(4/3) + 1) ~ .003127 > 1/320 = .003125. 
Thus, e? < 32nd) is sufficient to ensure the inequalities. m 


The theorem shows that for any algorithm A, in the non-realizable case, there exists 
a ‘bad’ distribution over Y x {0,1} such that the error of the hypothesis returned 
by A is o(,/4) with some constant probability. The VC-dimension appears as a 
critical quantity in learning in this general setting as well. In particular, with an 
infinite VC-dimension, agnostic PAC-learning is not possible. 


3.5 Chapter notes 


The use of Rademacher complexity for deriving generalization bounds in learning 
was first advocated by Koltchinskii [2001], Koltchinskii and Panchenko [2000], and 
Bartlett, Boucheron, and Lugosi [2002a], see also [Koltchinskii and Panchenko, 
2002, Bartlett and Mendelson, 2002]. Bartlett, Bousquet, and Mendelson [2002b] 
introduced the notion of local Rademacher complexity, that is the Rademacher 
complexity restricted to a subset of the hypothesis set limited by a bound on 
the variance. This can be used to derive better guarantees under some regularity 
assumptions about the noise. 

Theorem 3.3 is due to Massart [2000]. The notion of VC-dimension was introduced 
by Vapnik and Chervonenkis [1971] and has been since extensively studied [Vapnik, 
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2006, Vapnik and Chervonenkis, 1974, Blumer et al., 1989, Assouad, 1983, Dudley, 
1999]. In addition to the key role it plays in machine learning, the VC-dimension is 
also widely used in a variety of other areas of computer science and mathematics 
(e.g., see Shelah [1972], Chazelle [2000]). Theorem 3.5 is known as Sauer’s lemma 
in the learning community, however the result was first given by Vapnik and 
Chervonenkis [1971] (in a somewhat different version) and later independently by 
Sauer [1972] and Shelah [1972]. 

In the realizable case, lower bounds for the expected error in terms of the VC- 
dimension were given by Vapnik and Chervonenkis [1974] and Haussler et al. [1988). 
Later, a lower bound for the probability of error such as that of theorem 3.6 was 
given by Blumer et al. [1989]. Theorem 3.6 and its proof, which improves upon 
this previous result, are due to Ehrenfeucht, Haussler, Kearns, and Valiant [1988). 
Devroye and Lugosi [1995] gave slightly tighter bounds for the same problem with 
a more complex expression. Theorem 3.7 giving a lower bound in the non-realizable 
case and the proof presented are due to Anthony and Bartlett [1999]. For other 
examples of application of the probabilistic method demonstrating its full power, 
consult the reference book of Alon and Spencer [1992]. 

There are several other measures of the complexity of a family of functions used 
in machine learning, including covering numbers, packing numbers, and some other 
complexity measures discussed in chapter 10. A covering number WV;,,(G,«¢) is the 
minimal number of L, balls of radius € > 0 needed to cover a family of loss functions 
G. A packing number M,(G,e) is the maximum number of non-overlapping L, 
balls of radius € centered in G. The two notions are closely related, in particular 
it can be shown straightfowardly that M,(G,2¢€) < N,(G,e) < M,(G,e) for G 
and € > 0. Each complexity measure naturally induces a different reduction of 
infinite hypothesis sets to finite ones, thereby resulting in generalization bounds 
for infinite hypothesis sets. Exercise 3.22 illustrates the use of covering numbers 
for deriving generalization bounds using a very simple proof. There are also close 
relationships between these complexity measures: for example, by Dudley’s theorem, 
the empirical Rademacher complexity can be bounded in terms of No(G, €) [Dudley, 
1967, 1987] and the covering and packing numbers can be bounded in terms of the 
VC-dimension [Haussler, 1995]. See also [Ledoux and Talagrand, 1991, Alon et al., 
1997, Anthony and Bartlett, 1999, Cucker and Smale, 2001, Vidyasagar, 1997] for 
a number of upper bounds on the covering number in terms of other complexity 
measures. 


3.6 Exercises 


3.1 Growth function of intervals in R. Let H be the set of intervals in R. The VC- 
dimension of H is 2. Compute its shattering coefficient Iy(m), m > 0. Compare 
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your result with the general bound for growth functions. 


3.2 Lower bound on growth function. Prove that Sauer’s lemma (theorem 3.5) is 
tight, i.e., for any set X of m > d elements, show that there exists a hypothesis 
class H of VC-dimension d such that Iy(m) =“, (""). 


3.3 Singleton hypothesis class. Consider the trivial hypothesis set H = {ho}. 


(a) Show that #,,(H) = 0 for any m > 0. 


(b) Use a similar construction to show that Massart’s lemma (theorem 3.3) is 
tight. 


3.4 Rademacher identities. Fix m > 1. Prove the following identities for any a € R 
and any two hypothesis sets H and H’ of functions mapping from % to R: 


(a) Rm(aH) = |o|Rm(H). 

(b) Win (AH + AH’) = Rm (A) + Rin ( A’). 

(c) Rm({max(h, h’): h € H,h' € H’'}), 

where max(h,h’) denotes the function « ++ maxzex(h(x),h’(a)) (Hint: you 
could use the identity max(a,b) = [a+ 6+ |a — b|] valid for all a,b € R and 
Talagrand’s contraction lemma (see lemma 4.2)). 


3.5 Rademacher complexity. Professor Jesetoo claims to have found a better bound 
on the Rademacher complexity of any hypothesis set H of functions taking values 
in {—1,+1}, in terms of its VC-dimension VCdim(#). His bound is of the form 
M(H) < O(a), Can you show that Professor Jesetoo’s claim cannot be 
correct? (Hint: consider a hypothesis set H reduced to just two simple functions.) 


3.6 VC-dimension of union of k intervals. What is the VC-dimension of subsets of 
the real line formed by the union of k intervals? 


3.7 VC-dimension of finite hypothesis sets. Show that the VC-dimension of a finite 
hypothesis set H is at most log, |H]|. 


3.8 VC-dimension of subsets. What is the VC-dimension of the set of subsets Ig of 
the real line parameterized by a single parameter a: I, = [a,a+1]U [a+ 2,+co)? 


3.9 VC-dimension of closed balls in R”. Show that the VC-dimension of the set 
of all closed balls in R”, i-e., sets of the form {x € R”: ||z — xo||? < r} for some 
zo € R” and r > 0, is less than or equal to n + 2. 
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3.10 VC-dimension of ellipsoids. What is the VC-dimension of the set of all ellipsoids 
in R”? 


3.11 VC-dimension of a vector space of real functions. Let F’ be a finite-dimensional 
vector space of real functions on R”, dim(F’) = r < co. Let H be the set of 
hypotheses: 


H = {{w: f(a) > 0}: fe Fh: 


Show that d, the VC-dimension of H, is finite and that d < r. (Hint: select an 
arbitrary set of m = r+ 1 points and consider linear mapping u: F — R™ defined 


by: u(f) = (f(#1), +--+ f(@m))-) 


3.12 VC-dimension of sine functions. Consider the hypothesis family of sine func- 
tions (Example 3.5): {~ — sin(wx): w € R}. 


(a) Show that for any x € R the points 7, 27,32 and 4% cannot be shattered 
by this family of sine functions. 

(b) Show that the VC-dimension of the family of sine functions is infinite. 
(Hint: show that {2—~”: m € N} can be fully shattered for any m > 0.) 


3.13 VC-dimension of union of halfspaces. Determine the VC-dimension of the 
subsets of the real line formed by the union of k intervals. 


3.14 VC-dimension of intersection of halfspaces. Consider the class C, of convex 
intersections of k halfspaces. Give lower and upper bound estimates for VCdim(C;). 


3.15 VC-dimension of intersection concepts. 


(a) Let Cy and Cy be two concept classes. Show that for any concept class 
C= {a Nea: cr Ee C1, C2 E Cy}, 


To(m) < Me, (m) He, (m). (3.55) 


(b) Let C be a concept class with VC-dimension d and let C, be the concept 
class formed by all intersections of s concepts from C’, s > 1. Show that the 
VC-dimension of C’, is bounded by 2dslog,(3s). (Hint: show that log,(3x) < 
9x/(2e) for any x > 2.) 


3.16 VC-dimension of union of concepts. Let A and B be two sets of functions 
mapping from X into {0,1}, and assume that both A and B have finite VC- 
dimension, with VCdim(A) = d, and VCdim(B) = dg. Let C = AUB be the 
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union of A and B. 


(a) Prove that for all m, Io(m) < IL4(m) + IIp(m). 
(b) Use Sauer’s lemma to show that for m > d,4 + dp + 2, Uc(m) < 2”, and 
give a bound on the VC-dimension of C. 


3.17 VC-dimension of symmetric difference of concepts. For two sets A and B, let 
AAB denote the symmetric difference of A and B, i.e., AAB = (AU B)—(ANB). 
Let H be a non-empty family of subsets of X with finite VC-dimension. Let A be 
an element of H and define HAA = {XAA: X ©€ H}. Show that 


VCdim(HAA) = VCdim(#). 


3.18 Symmetric functions. A function h: {0,1}" — {0,1} is symmetric if its value 
is uniquely determined by the number of 1’s in the input. Let C' denote the set of 
all symmetric functions. 


(a) Determine the VC-dimension of C. 


(b) Give lower and upper bounds on the sample complexity of any consistent 
PAC learning algorithm for C. 


(c) Note that any hypothesis h € C'can be represented by a vector (Yo, Y1, +++; Yn) € 
{0,1}"*!, where y; is the value of h on examples having precisely i 1’s. Devise 
a consistent learning algorithm for C' based on this representation. 


3.19 Biased coins. Professor Moent has two coins in his pocket, coin «4 and coin 
xg. Both coins are slightly biased, i-e., Pr[a4 = 0] = 1/2 — €/2 and Pr[apg = 0] = 
1/2 + €/2, where 0 < € < 1 is a small positive number, 0 denotes heads and 1 
denotes tails. He likes to play the following game with his students. He picks a coin 
x € {x,4,xp} from his pocket uniformly at random, tosses it m times, reveals the 
sequence of Os and 1s he obtained and asks which coin was tossed. Determine how 
large m needs to be for a student’s coin prediction error to be at most 6 > 0. 


(a) Let S be a sample of size m. Professor Moent’s best student, Oskar, plays 
according to the decision rule f,: {0,1} — {a,4,ap} defined by f,(S) = x4 
iff N(S) < m/2, where N(S) is the number of 0’s in sample S. 

Suppose m is even, then show that 


error(fo) = 5Pt [N(s) > +s — wa] 2 (3.56) 


(b) Assuming m even, use the inequalities given in the appendix (section D.3) 
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to show that 
error(fo) > -[t - [1 - eral j . (3.57) 


(c) Argue that if m is odd, the probability can be lower bounded by using 
m+ 1 in the bound in (a) and conclude that for both odd and even m, 


1 


error(fo) > rt - [1 - eee) ‘] : (3.58) 


(d) Using this bound, how large must m be if Oskar’s error is at most 6, where 
0<06< 1/4. What is the asymptotic behavior of this lower bound as a function 
of €? 

(e) Show that no decision rule f: {0,1} — {aa,ag} can do better than 
Oskar’s rule f,. Conclude that the lower bound of the previous question applies 
to all rules. 


3.20 Infinite VC-dimension. 


(a) Show that if a concept class C’ has infinite VC-dimension, then it is not 
PAC-learnable. 


(b) In the standard PAC-learning scenario, the learning algorithm receives all 
examples first and then computes its hypothesis. Within that setting, PAC- 
learning of concept classes with infinite VC-dimension is not possible as seen 
in the previous question. 

Imagine now a different scenario where the learning algorithm can alternate 
between drawing more examples and computation. The objective of this prob- 
lem is to prove that PAC-learning can then be possible for some concept classes 
with infinite VC-dimension. 

Consider for example the special case of the concept class C of all subsets of 
natural numbers. Professor Vitres has an idea for the first stage of a learning 
algorithm L PAC-learning C. In the first stage, DL draws a sufficient number of 
points m such that the probability of drawing a point beyond the maximum 
value M observed be small with high confidence. Can you complete Professor 
Vitres’ idea by describing the second stage of the algorithm so that it PAC- 
learns C? The description should be augmented with the proof that D can 
PAC-learn C. 


3.21 VC-dimension generalization bound — realizable case. In this exercise we show 
that the bound given in corollary 3.4 can be improved to O( Lestnld) ) in the 
realizable setting. Assume we are in the realizable scenario, i.e. the target concept is 
included in our hypothesis class H. We will show that if a hypothesis h is consistent 
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with a sample S ~ D™ then for any € > 0 such that me > 8 
d 
| oe (3.59) 


(a) Let Hs C H be the subset of hypotheses consistent with the sample S, 
let Rg(h) denote the empirical error with respect to the sample S and define 
S’ as a another independent sample drawn from D™. Show that the following 
inequality holds for any ho € Hs: 


Pr sup |Rg(h) — Rsr(h)| > =| > Pr [ Bim, 4 = > Pr[R(ho) > dl, 
hes 


where B[m,e¢] is a binomial random variable with parameters [m,¢]. (Hint: 


n~ n 


prove and use the fact that Pr[R(h) > $] > Pr[R(h) > § A R(h) > €.) 


(b) Prove that Pr [B(m, 6) > me] > 4. Use this inequality along with the 


result from (a) to show that for any ho € Hs 


Pr [R(Ro) > ( <2Pr sup |Rg(h) — Rg(h)| > £]. 
heHs 2 

(c) Instead of drawing two samples, we can draw one sample T of size 2m then 

uniformly at random split it into S and S$’. The right hand side of part (b) can 

then be rewritten as: 


€ 


Pr[ sup |Rs(h)—Rs-(h)| > =| = Pr [ane H : Bs(h) =0 A Rg(h) > £] 
hes : 


2 


Let ho be a hypothesis such that Rr(ho) > 5 and let 1 > ™ be the total 
number of errors ho makes on T. Show that the probability of all / errors 
falling into S’ is upper bounded by 2~!. 
(d) Part (b) implies that for any h © H 


Pr [fs(h) =0 A Rs(h) > S | Rer(ho) ><] <2. 
Tx D2™: 2 2 
T=(S,S") 


Use this bound to show that for any h © H 


Pr [s(h) =0 A Bs(h) > <| <or 
TD”: 2 
T=(S,S") 


(e) Complete the proof of inequality (3.59) by using the union bound to upper 


bound Pr pW p2m. [ane H: Rs(h) =0A Rg(h) > s| . Show that we can achieve 
T—(8,8") 
a high probability generalization bound that is of the order O 


(teelm/d) i 
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3.22 Generalization bound based on covering numbers. Let H be a family of 
functions mapping %V to a subset of real numbers Y C R. For any e€ > 0, the 
covering number N(H,e¢) of H for the L.. norm is the minimal k € N such that H 
can be covered with k balls of radius ¢, that is, there exists {hi,...,h,} C H such 
that, for all h € H, there exists i < k with ||h — hi||.o = maxzey |h(x) — hi(x)| < e. 
In particular, when H is a compact set, a finite covering can be extracted from a 
covering of H with balls of radius € and thus M(H, ) is finite. 


Covering numbers provide a measure of the complexity of a class of functions: the 
larger the covering number, the richer is the family of functions. The objective of 
this problem is to illustrate this by proving a learning bound in the case of the 
squared loss. Let D denote a distribution over ¥ x Y according to which labeled 
examples are drawn. Then, the generalization error of h € H for the squared loss is 
defined by R(h) = Eve,y)~p[(h(«) — y)?] and its empirical error for a labeled sample 
S = ((21,y1),-+-;(@msYm)) by R(A) = + TY, (h(a) —ys)?. We will assume that H 


is bounded, that is there exists MM > 0 such that |h(x)—y| < M for all (a, y) € © xy. 
The following is the generalization bound proven in this problem: 
2 


ee [sup |R(h) — R(n)| = €] < N(H, saz )2exP (5) . (3.60) 


The proof is based on the following steps. 
(a) Let Lg = R(h) — R(h), then show that for all hy, hy € H and any labeled 
sample S$, the following inequality holds: 
|Ls(hi) — Lg(h2)| < 4M [hi — alloc - 


(b) Assume that H can be covered by k subsets Bi,..., Bx, that is H = 
B,U...UB,. Then, show that, for any € > 0, the following upper bound holds: 


k 
P Leth ><] < P Lg(h >. 
oft ) s(h)| > e <>), r ae) s(h)| 2 


(c) Finally, let k = N(H, 347) and let B,,...,B, be balls of radius «/(8M) 
centered at h1,...,h, covering H. Use part (a) to show that for all 7 € [1, ], 


€ 
Pr | Ls(h)l =e] < Pr. [lEs(ta)l =<], 
gretlgee SS geenl ele) 


and apply Hoeffding’s inequality (theorem D.1) to prove (3.60). 


4 Support Vector Machines 


This chapter presents one of the most theoretically well motivated and practically 
most effective classification algorithms in modern machine learning: Support Vector 
Machines (SVMs). We first introduce the algorithm for separable datasets, then 
present its general version designed for non-separable datasets, and finally provide 
a theoretical foundation for SVMs based on the notion of margin. We start with 
the description of the problem of linear classification. 


4.1 Linear classification 


Consider an input space ¥Y that is a subset of RN with N > 1, and the output 
or target space Y = {—1,+1}, and let f: Y — Y be the target function. Given 
a hypothesis set H of functions mapping ¥ to J, the binary classification task is 
formulated as follows. The learner receives a training sample S of size m drawn i.i.d. 
from ¥ according to some unknown distribution D, S = ((#1,y1),---;(@mYm)) € 
(X x Y)™, with y; = f(x;) for all i € [1,m]. The problem consists of determining a 
hypothesis h € H, a binary classifier, with small generalization error: 


Rp(h) = Pr (h(a) # f(a)]. (4.1) 


a~D 


Different hypothesis sets H can be selected for this task. In view of the results 
presented in the previous section, which formalized Occam’s razor principle, hy- 
pothesis sets with smaller complexity — e.g., smaller VC-dimension or Rademacher 
complexity — provide better learning guarantees, everything else being equal. A 
natural hypothesis set with relatively small complexity is that of linear classifiers, 
or hyperplanes, which can be defined as follows: 


H = {x sign(w-x+b): we R,beER}. (4.2) 


A hypothesis of the form x +> sign(w-x +b) thus labels positively all points falling 
on one side of the hyperplane w-x + b= 0 and negatively all others. The problem 
is referred to as a linear classification problem. 
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gone ey 


Figure 4.1 Two possible separating hyperplanes. The right-hand side figure shows 
a hyperplane that maximizes the margin. 


4.2 SVMs — separable case 


In this section, we assume that the training sample S can be linearly separated, 
that is, we assume the existence of a hyperplane that perfectly separates the 
training sample into two populations of positively and negatively labeled points, 
as illustrated by the left panel of figure 4.1. But there are then infinitely many 
such separating hyperplanes. Which hyperplane should a learning algorithm select? 
The solution returned by the SVM algorithm is the hyperplane with the maximum 
margin, or distance to the closest points, and is thus known as the mazimum-margin 
hyperplane. The right panel of figure 4.1 illustrates that choice. 

We will present later in this chapter a margin theory that provides a strong 
justification for this solution. We can observe already, however, that the SVM 
solution can also be viewed as the “safest” choice in the following sense: a test 
point is classified correctly by a separating hyperplane with margin p even when 
it falls within a distance p of the training samples sharing the same label; for the 
SVM solution, p is the maximum margin and thus the “safest” value. 


4.2.1 Primal optimization problem 


We now derive the equations and optimization problem that define the SVM 
solution. The general equation of a hyperplane in RY is 


w-x+b=0, (4.3) 


where w € RY is a non-zero vector normal to the hyperplane and b € Ra 
scalar. Note that this definition of a hyperplane is invariant to non-zero scalar 
multiplication. Hence, for a hyperplane that does not pass through any sample 
point, we can scale w and b appropriately such that min(x,y)eg |w - x + b| = 1. 
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Figure 4.2 Margin and equations of the hyperplanes for a canonical maximum- 
margin hyperplane. The marginal hyperplanes are represented by dashed lines on 
the figure. 


We define this representation of the hyperplane, i.e., the corresponding pair (w, b), 
as the canonical hyperplane. The distance of any point x9 € R% to a hyperplane 
defined by (4.3) is given by 


(4.4) 
I| w| 
Thus, for a canonical hyperplane, the margin p is given by 
. b 1 
min hue So = : (4.5) 
Geyes — ||w| || w| 


Figure 4.2 illustrates the margin for a maximum-margin hyperplane with a canon- 
ical representation (w, 0). It also shows the marginal hyperplanes, which are the 
hyperplanes parallel to the separating hyperplane and passing through the closest 
points on the negative or positive sides. Since they are parallel to the separating 
hyperplane, they admit the same normal vector w. Furthermore, by definition of a 
canonical representation, for a point x on a marginal hyperplane, |w - x + b| = 1, 
and thus the equations of the marginal hyperplanes are w-x+b= +1. 

A hyperplane defined by (w, b) correctly classifies a training point x;, 7 € [l,m] 
when w - x; + 0 has the same sign as y;. For a canonical hyperplane, by definition, 
we have |w- x; + b| > 1 for all 7 © [1,mJ]; thus, x; is correctly classified when 
yi(w-x;+0) > 1. In view of (4.5), maximizing the margin of a canonical hyperplane 
is equivalent to minimizing ||w|| or 4||w||?. Thus, in the separable case, the SVM 
solution, which is a hyperplane maximizing the margin while correctly classifying all 
training points, can be expressed as the solution to the following convex optimization 
problem: 
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a ll 2 
min 5I|w| (4.6) 


subject to: y;(w-x; +b) >1, Vi € [1,m]. 


The objective function F: w > $||w||? is infinitely differentiable. Its gradient is 
Vw(F) = w and its Hessian the identity matrix V?F(w) = I, whose eigenvalues are 
strictly positive. Therefore, V?F(w) > 0 and F is strictly convex. The constraints 
are all defined by affine functions g;: (w, b) +> 1—y;(w-x;+0) and are thus qualified. 
Thus, in view of the results known for convex optimization (see appendix B for 
details), the optimization problem of (4.6) admits a unique solution, an important 
and favorable property that does not hold for all learning algorithms. 

Moreover, since the objective function is quadratic and the constraints affine, the 
optimization problem of (4.6) is in fact a specific instance of quadratic program- 
ming (QP), a family of problems extensively studied in optimization. A variety of 
commercial and open-source solvers are available for solving convex QP problems. 
Additionally, motivated by the empirical success of SVMs along with its rich theo- 
retical underpinnings, specialized methods have been developed to more efficiently 
solve this particular convex QP problem, notably the block coordinate descent al- 
gorithms with blocks of just two coordinates. 


4.2.2 Support vectors 


The constraints are affine and thus qualified. The objective function as well as the 
affine constraints are convex and differentiable. Thus, the hypotheses of theorem B.8 
hold and the KKT conditions apply at the optimum. We shall use these conditions 
to both analyze the algorithm and demonstrate several of its crucial properties, 
and subsequently derive the dual optimization problem associated to SVMs in 
section 4.2.3. 
We introduce Lagrange variables a; > 0, 1 

constraints and denote by @ the vector (a1,...,Qm 
defined for all w € RY, bE R, and ae R?, by 


[l,m], associated to the m 


= 
)'. The Lagrangian can then be 


Lb = sllwll? — So aalye(w x: +0) — i). (4.7) 


i=1 


The KKT conditions are obtained by setting the gradient of the Lagrangian with 
respect to the primal variables w and b to zero and by writing the complementarity 
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conditions: 
VwlL=w- x ayYyix; = 0 => w= ye OuYiXi (4.8) 
i=1 i=l 
VilL=— > ay: =0 = > aiyi = 0 (4.9) 
i=1 i=l 


By equation 4.8, the weight vector w solution of the SVM problem is a linear combi- 
nation of the training set vectors x,,...,Xm. A vector x; appears in that expansion 
iff a; ~ 0. Such vectors are called support vectors. By the complementarity condi- 
tions (4.10), if a; 4 0, then y;(w-x; + b) = 1. Thus, support vectors lie on the 
marginal hyperplanes w- x; +b= +1. 

Support vectors fully define the maximum-margin hyperplane or SVM solution, 
which justifies the name of the algorithm. By definition, vectors not lying on the 
marginal hyperplanes do not affect the definition of these hyperplanes — in their 
absence, the solution to the SVM problem remains unchanged. Note that while the 
solution w of the SVM problem is unique, the support vectors are not. In dimension 
N, N+1 points are sufficient to define a hyperplane. Thus, when more than N + 1 
points lie on a marginal hyperplane, different choices are possible for the N + 1 
support vectors. 


4.2.3 Dual optimization problem 


To derive the dual form of the constrained optimization problem (4.6), we plug 
into the Lagrangian the definition of w in terms of the dual variables as expressed 
in (4.8) and apply the constraint (4.9). This yields 


1 m m m m 
L= 5 So aiyixill? = x Qj YiY; (Xi : x;) = S- aiyibt+ )> Qi (4.11) 
i=1 ij=l i=l i=l 
Z —SS— 
0 


Bar Ci VEY, (Ki-Xy) 


which simplifies to 


1 m 
f= x i= 5 os O45 YiY; (Xi ; x;) . (4.12) 


i=1 ij=l 
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This leads to the following dual optimization problem for SVMs in the separable 
case: 


m 1 m 
max 2 w—5 y Q4Q;YiY; (Xi * Xj) (4.13) 
i=1 ai,j=1 
subject to: a; > 0A So aiyi =0, Vie [l,m]. 
i=l 


The objective function G: a + YO", ai — 3 oy 1 0405 YY; (Xi - Xj) is infinitely 


differentiable. Its Hessian is given by V?G = —A, with A = (yix: : YiX3) i A is 
the Gram matrix associated to the vectors y1X1,.--,;YmXm and is therefore positive 
semidefinite, which shows that V?G = 0 and that G is a concave function. Since 
the constraints are affine and convex, the maximization problem (4.13) is equivalent 
to a convex optimization problem. Since G is a quadratic function of a, this dual 
optimization problem is also a QP problem, as in the case of the primal optimization 
and once again both general-purpose and specialized QP solvers can be used to 
obtain the solution (see exercise 4.4 for details on the SMO algorithm, which is 
often used to solve the dual form of the SVM problem in the more general non- 
separable setting). 

Moreover, since the constraints are affine, they are qualified and strong duality 
holds (see appendix B). Thus, the primal and dual problems are equivalent, i.e., 
the solution @ of the dual problem (4.13) can be used directly to determine the 
hypothesis returned by SVMs, using equation (4.8): 


m 
h(x) = sgn(w-x + b) = sgn e2 aryi(X; x) +0). (4.14) 
i=1 
Since support vectors lie on the marginal hyperplanes, for any support vector x;, 
w-x;+6= y;, and thus b can be obtained via 


b= ys — >> ayyy(xj +x). (4.15) 


The dual optimization problem (4.13) and the expressions (4.14) and (4.15) reveal 
an important property of SVMs: the hypothesis solution depends only on inner 
products between vectors and not directly on the vectors themselves. 

Equation (4.15) can now be used to derive a simple expression of the margin p in 
terms of a. Since (4.15) holds for all ¢ with a; 4 0, multiplying both sides by a;y; 
and taking the sum leads to 


s ayyib = > ay? — ye Oj Yi; (Xi Xj) - (4.16) 
i=1 i=1 


ij=l 
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Using the fact that y? = 1 along with equation 4.8 then yields 


0= au - || w||?. (4.17) 


Noting that a; > 0, we obtain the following expression of the margin p in terms of 
the Ly norm of a: 
2 iL 1 1 


p= = ™ = ‘ (4.18) 
IIwlz Star lela 


4.2.4 Leave-one-out analysis 


We now use the notion of leave-one-out error to derive a first learning guarantee 
for SVMs based on the fraction of support vectors in the training set. 


Definition 4.1 Leave-one-out error 

Let hg denote the hypothesis returned by a learning algorithm A, when trained on 
a fixed sample S. Then, the leave-one-out error of A on a sample S of size m is 
defined by 


& 1 
Rioo(A) = a ee Lista, (ei) yi" 
i=1 


Thus, for each i € [1,m], A is trained on all the points in S except for xj, ie., 
S — {a;}, and its error is then computed using «;. The leave-one-out error is the 
average of these errors. We will use an important property of the leave-one-out error 
stated in the following lemma. 


Lemma 4.1 
The average leave-one-out error for samples of size m > 2 is an unbiased estimate 
of the average generalization error for samples of size m—1: 


oEynlfrool(A)] =, E,[Rls’)], (4.19) 


where D denotes the distribution according to which points are drawn. 
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Proof By the linearity of expectation, we can write 


m 


1 


gdm lRLoo(A)] = — i gS bmllas- ten @d#uil 
4=1: 
= geymltns— tay (er4ul 


_ yiasst teas hel (e1)#y1] 


7 aie hed [ plas: (ex) 4yi] 


= ,,E, [Ras 


For the second equality, we used the fact that, since the points of S are drawn in an 
iid. fashion, the expectation Eg. pm[1p,,_;,.,(«:)¢y;] does not depend on the choice 
of i € [l,m] and is thus equal to Es.pm([Ip,_(,.)(e)¢m) 


In general, computing the leave-one-out error may be costly since it requires training 
m times on samples of size m—1. In some situations however, it is possible to derive 
the expression of Rjoo(A) much more efficiently (see exercise 10.9). 


Theorem 4.1 
Let hg be the hypothesis returned by SVMs for a sample S, and let Nsy (S$) be the 
number of support vectors that define hg. Then, 


[R(hs)|< E A) 


E < 
SvD™ Sxpm™+1 | m+1 

Proof Let S be a linearly separable sample of m+ 1. If x is not a support vector 
for hg, removing it does not change the SVM solution. Thus, hs_s,; = hs and 
hg—{} correctly classifies x. By contraposition, if hg_,,} misclassifies x, z must be 
a support vector, which implies 

a Ngv(S 

Rioo(SVM) < Nsv(5) | (4.20) 


m+i1 


Taking the expectation of both sides and using lemma 4.1 yields the result. m 


Theorem 4.1 gives a sparsity argument in favor of SVMs: the average error of 
the algorithm is upper bounded by the average fraction of support vectors. One 
may hope that for many distributions seen in practice, a relatively small number 
of the training points will lie on the marginal hyperplanes. The solution will then 
be sparse in the sense that a small fraction of the dual variables a; will be non- 
zero. Note, however, that this bound is relatively weak since it applies only to the 
average generalization error of the algorithm over all samples of size m. It provides 
no information about the variance of the generalization error. In section 4.4, we 
present stronger high-probability bounds using a different argument based on the 
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w-x+b=—1° 


Figure 4.3 A separating hyperplane with point 2; classified incorrectly and point 
x; correctly classified, but with margin less than 1. 


notion of margin. 


4.3 SVMs — non-separable case 


In most practical settings, the training data is not linearly separable, i.e., for any 
hyperplane w-x + b= 0, there exists 7; € S such that 


yilw-xi +b] Z1. (4.21) 


Thus, the constraints imposed in the linearly separable case discussed in section 4.2 
cannot all hold simultaneously. However, a relaxed version of these constraints can 
indeed hold, that is, for each 7 € [1, mJ], there exist €; > 0 such that 


The variables €; are known as slack variables and are commonly used in optimization 
to define relaxed versions of some constraints. Here, a slack variable €; measures 
the distance by which vector x; violates the desired inequality, y;(w +x; +6) > 1. 
Figure 4.3 illustrates the situation. For a hyperplane w -x + b = 0, a vector x; 
with €; > 0 can be viewed as an outlier. Each x; must be positioned on the correct 
side of the appropriate marginal hyperplane to not be considered an outlier. As a 
consequence, a vector x; with 0 < y;(w-x;+ 0) < 1 is correctly classified by the 
hyperplane w:x+b = 0 but is nonetheless considered to be an outlier, that is, €; > 0. 
If we omit the outliers, the training data is correctly separated by w-x +b = 0 
with a margin p = 1/||w|| that we refer to as the soft margin, as opposed to the 
hard margin in the separable case. 

How should we select the hyperplane in the non-separable case? One idea consists 
of selecting the hyperplane that minimizes the empirical error. But, that solution 
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Figure 4.4 Both the hinge loss and the quadratic hinge loss provide convex upper 
bounds on the binary zero-one loss. 


will not benefit from the large-margin guarantees we will present in section 4.4. 
Furthermore, the problem of determining a hyperplane with the smallest zero-one 
loss, that is the smallest number of misclassifications, is NP-hard as a function of 
the dimension N of the space. 

Here, there are two conflicting objectives: on one hand, we wish to limit the 
total amount of slack due to outliers, which can be measured by 57}, &, or, more 
generally by 57""., €? for some p > 1; on the other hand, we seek a hyperplane with 
a large margin, though a larger margin can lead to more outliers and thus larger 
amounts of slack. 


4.3.1 Primal optimization problem 


This leads to the following general optimization problem defining SVMs in the 
non-separable case where the parameter C' > 0 determines the trade-off between 
margin-maximization (or minimization of ||w||?) and the minimization of the slack 
penalty S77", €?: 


1 m 
min, 5llwll? +0 °@ (4.23) 


w,b, f 
w=1 


subject to yj(w-x,; +6) >1—§ A §& >0,i€ [l,m], 


where € = (&1,...,€m)'. The parameter C is typically determined via n-fold cross- 
validation (see section 1.3). 

As in the separable case, (4.23) is a convex optimization problem since the 
constraints are affine and thus convex and since the objective function is convex 
for any p > 1. In particular, > 7", GP = |€||5 is convex in view of the convexity 
of the norm || - ||). 
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There are many possible choices for p leading to more or less aggressive penal- 
izations of the slack terms (see exercise 4.1). The choices p = 1 and p = 2 lead to 
the most straightforward solutions and analyses. The loss functions associated with 
p=1land p= 2 are called the hinge loss and the quadratic hinge loss, respectively. 
Figure 4.4 shows the plots of these loss functions as well as that of the standard 
zero-one loss function. Both hinge losses are convex upper bounds on the zero-one 
loss, thus making them well suited for optimization. In what follows, the analysis is 
presented in the case of the hinge loss (p = 1), which is the most widely used loss 
function for SVMs. 


4.3.2 Support vectors 


As in the separable case, the constraints are affine and thus qualified. The objective 
function as well as the affine constraints are convex and differentiable. Thus, the 
hypotheses of theorem B.8 hold and the KKT conditions apply at the optimum. 
We use these conditions to both analyze the algorithm and demonstrate several 
of its crucial properties, and subsequently derive the dual optimization problem 
associated to SVMs in section 4.3.3. 

We introduce Lagrange variables a; > 0, i € [1,mJ], associated to the first m 
constraints and 8; > 0, i € [1,m] associated to the non-negativity constraints of 
the slack variables. We denote by @ the vector (a1,...,@m)' and by @ the vector 
(G1,..-,8m)'. The Lagrangian can then be defined for all w € RY, b € R, and 
a € R’, by 


1 m m m 
L(w, b, &,a, 8) = 5llwl?+C > 7 &— >> aalyi(w-xi +b) —1+&]— >) Bigs. (4.24) 
i=1 i=1 i=1 
The KKT conditions are obtained by setting the gradient of the Lagrangian 
with respect to the primal variables w, b, and &;s to zero and by writing the 
complementarity conditions: 


VwlL=w- De AQaYixXi = 0 => w= ¥ QGYiXi (4.25) 
i=1 i=l 
VilL=—-S ais =0 = YSiayi =0 (4.26) 
i=l i=1 
Vel =C—a;— 8, =0 = a+tB=C (4.27) 
Vi, ailys(w-x; +b) -—14 &] =0 = a, =0Vyi(w-x;,+b)=1-6 (4.28) 
Vi, Bi€i = 0 = 6£=0V8,=0. (4.29) 


By equation 4.25, as in the separable case, the weight vector w solution of the 
SVMproblem is a linear combination of the training set vectors x1,...,Xm. A vector 
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X; appears in that expansion iff a; 4 0. Such vectors are called support vectors. Here, 
there are two types of support vectors. By the complementarity condition (4.28), if 
a; # 0, then y;(w-x; + 6) = 1— &. If €& = 0, then y;(w-x; + b) = 1 and x; lies 
on a marginal hyperplane, as in the separable case. Otherwise, €; #4 0 and x; is an 
outlier. In this case, (4.29) implies 8; = 0 and (4.27) then requires a; = C. Thus, 
support vectors x; are either outliers, in which case a; = C, or vectors lying on the 
marginal hyperplanes. As in the separable case, note that while the weight vector 
w solution is unique, the support vectors are not. 


4.3.3. Dual optimization problem 


To derive the dual form of the constrained optimization problem (4.23), we plug 


into the Lagrangian the definition of w in terms of the dual variables (4.25) and 
apply the constraint (4.26). This yields 


1 m m m m 
L= 5! > ayixil|? = > AGAGUYiY; (x; : x;) = y ayyiot y Ay. (4.30) 
i=1 ij=l i=l i=l 
Z SS —_ 
0 


Bar C1 yeys (Ki-x,) 


Remarkably, we find that the objective function is no different than in the separable 
case: 
m m 


1 
L= ) a= 5 ; Og Oly Yas (Ki é %7) . (4.31) 
i=1 


ig=1 


However, here, in addition to a; > 0, we must impose the constraint on the Lagrange 
variables 3; > 0. In view of (4.27), this is equivalent to a; < C. This leads to the 
following dual optimization problem for SVMs in the non-separable case, which only 
differs from that of the separable case (4.13) by the constraints a; < C: 


m 1 m 
max .s ai 5 S- Q4Qj;YiY; (Xi - X;) (4.32) 
i=1 i,j=l 
subject to: 0 <<a; < CA >) ay: = 0,7 € [1,m]. 
i=1 

Thus, our previous comments about the optimization problem (4.13) apply to (4.32) 
as well. In particular, the objective function is concave and infinitely differentiable 
and (4.32) is equivalent to a convex QP. The problem is equivalent to the primal 

problem (4.23). 
The solution a@ of the dual problem (4.32) can be used directly to determine the 
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hypothesis returned by SVMs, using equation (4.25): 
m 
h(x) = sgn(w-x-+ 6) = sgn ies ouyi(Xs > x) + b). (4.33) 
i=1 
Moreover, b can be obtained from any support vector x; lying on a marginal 
hyperplane, that is any vector x; with 0 < a; < C. For such support vectors, 
w-x;+6=y; and thus 


b= Yi - cS arjyj (Xj * Xi) « (4.34) 
j=l 


As in the separable case, the dual optimization problem (4.32) and the expressions 
(4.33) and (4.34) show an important property of SVMs: the hypothesis solution 
depends only on inner products between vectors and not directly on the vectors 
themselves. This fact can be used to extend SVMs to define non-linear decision 
boundaries, as we shall see in chapter 5. 


4.4 Margin theory 


This section presents generalization bounds based on the notion of margin, which 
provide a strong theoretical justification for the SVM algorithm. We first give the 
definitions of some basic margin concepts. 


Definition 4.2 Margin 
The geometric margin p(x) of a point x with label y with respect to a linear classifier 
h:xtew-x+b is its distance to the hyperplanew-x +b=0: 


y(w-x+b) 
p(z) (4.35) 
Il 
The margin of a linear classifier h for a sample S = (x1,...,Xm) is the minimum 
margin over the points in the sample: 
_ ys (wx; +0) 
P= aiem wih ae 


Recall that the VC-dimension of the family of hyperplanes or linear hypotheses in 
RY is N+1. Thus, the application of the VC-dimension bound (3.31) of corollary 3.4 
to this hypothesis set yields the following: for any 6 > 0, with probability at least 
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1— 0, for any h € H, 


R(h) < R(h) 4 oe NT | [set (4.37) 


m QM ° 


When the dimension of the feature space N is large compared to the sample size, 
this bound is uninformative. The following theorem presents instead a bound on the 
VC-dimension of canonical hyperplanes that does not depend on the dimension of 
feature space N, but only on the margin and the radius r of the sphere containing 
the data. 


Theorem 4.2 
Let S C {x: ||x|| <r}. Then, the VC-dimension d of the set of canonical hyperplanes 
{x > sgn(w-x): minzes |w-x| = 1A ||w|| < A} verifies 


d<r7A?. 


Proof Assume {x,,...,Xq} is a set that can be fully shattered. Then, for all 
y =(m1,---, Ya) € {-1, +1}, there exists w such that, 


Summing up these inequalities yields 


d d d 
d<w- > yixi < |lwllll D> yxill < All 0 yoxl 
= t=1 i=1 


Since this inequality holds for all y € {—1,+1}4, it also holds on expectation over 
Yi,---, Yq drawn i.i.d. according to a uniform distribution over {—1, +1}. In view of 
the independence assumption, for i # j7 we have Ely;y;] = E[y;] Ely;]. Thus, since 
the distribution is uniform, Ely;y;] = 0 if i ¢ 7, Ely:y;] = 1 otherwise. This gives 


d 
d< AE| s Yixill] (taking expectations) 

v1 
r @ 1/2 

< Al E[|| .s vir] (Jensen’s inequality) 
a A 
r 1/2 

=A = E[yeys] (Xi -x;)| 
“ij=l i 

d 
= ALS Cx : > a < A|dr?] — Ar Vd. 


i=l 


Thus, Vd < Ar, which completes the proof. = 
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When the training data is linearly separable, by the results of section 4.2, the 
maximum-margin canonical hyperplane with ||w|| = 1/p can be plugged into 
theorem 4.2. In this case, A can be set to 1/p, and the upper bound can be rewritten 
as r?/p?. Note that the choice of A must be made before receiving the sample S. 

It is also possible to bound the Rademacher complexity of linear hypotheses with 
bounded weight vector in a similar way, as shown by the following theorem. 


Theorem 4.3 
Let S C {a: ||x|| < R} be a sample of size m and let H = {x w-x: ||w|| < A}. 
Then, the empirical Rademacher complexity of H can be bounded as follows: 
a 2A2 
Rs(H) < . : 


m 


Proof The proof follows through a series of inequalities similar to those of theo- 
rem 4.2: 


m 


Bs(H) = B[ Down] = DEL Dom] < pBl| Lom 
w=1 w=1 t=1 


t[ellSef]] -3[elS eel)” 
i=l ij=l 
= [bet er cea 


IA 


IA 


= m m ? 

The first inequality makes use of the Cauchy-Schwarz inequality and the bound on 
||w||, the second follows by Jensen’s inequality, the third by E[oja;] = E[o;] E[oj| = 
0 for i # j, and the last one by ||x;||< R. 


To present the main margin-based generalization bounds of this section, we need 
to introduce a margin loss function. Here, the training data is not assumed to be 
separable. The quantity p > 0 should thus be interpreted as the margin we wish to 
achieve. 


Definition 4.3 Margin loss function 
For any p > 0, the p-margin loss is the function Lp): R x R > Rx defined for all 


yy’ ER by L(y, y') = ®p(yy’) with, 


0 ifp<x 
®,(z)=(1—-a2/p if0<2<p 
1 ifx<0. 


This loss function is illustrated in figure 4.5. The empirical margin loss is then 
defined as the margin loss over the training sample. 
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Figure 4.5 The margin loss, defined with respect to margin parameter p. 


Definition 4.4. Empirical margin loss 
Given a sample S = (#1,...,%m) and a hypothesis h, the empirical margin loss is 


defined by 


3 


Ry(h) = —J>,(ysh(e,)). (4.38) 


1 

™m * 
i=l 

Note that for any i € [1,m], ®p(yih(xi)) < 1y,n(w,)<p- Thus, the empirical margin 


loss can be upper-bounded as follows: 
a 1 m 
Rp(h) < — Ylynaae: (4.39) 
i=1 


In all the results that follow, the empirical margin loss can be replaced by this upper 
bound, which admits a simple interpretation: it is the fraction of the points in the 
training sample S that have been misclassified or classified with confidence less than 
p. When h is a linear function defined by a weight vector w with ||w|| = 1, y;h(2x;) 
is the margin of point z;. Thus, the upper bound is then the fraction of the points 
in the training data with margin less than p. This corresponds to the loss function 
indicated by the blue dotted line in figure 4.5. 

The slope of the function ®, defining the margin loss is at most 1/p, thus ®, is 
1/p-Lipschitz. The following lemma bounds the empirical Rademacher complexity 
of a hypothesis set H after composition with such a Lipschitz function in terms of 
the empirical Rademacher complexity of H. It will be needed for the proof of the 
margin-based generalization bound. 


Lemma 4.2 Talagrand’s lemma 
Let ®: R — R be an I-Lipschitz. Then, for any hypothesis set H of real-valued 
functions, the following inequality holds: 


Ro(@o H) <1Rs(H). 
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Proof First we fix a sample S = (21,...,2%m), then, by definition, 
> 1 
Ro(O oH) = —E| sup) 94( 0 h)(xi)| 
=—  B [E | sup mah) + om(®0h)(am)|I, 
, om ' hE 
where Um—1(h) = 323" o;(®oh) (ai). By definition of the supremum, for any € > 0, 


there exist h1, ha € H such that 


tm—1(h1) + (© 01) (tm) > (1-6) sup tm—1(h) + (#2 h)(m)| 


and tm—1(h2) — (® 0 he)(am) > (1—e) sup tm—a(h) ~ (® h)(am)|- 


Thus, for any € > 0, by definition of Ez,,,, 


(l-e) E | sup Um—1(h) + om(® o h)(m)| 


Om h 


< Fltim—1(h1) + (® 0 h)(2m)] + 5 letm—a( ha) ~ (B © ha) 2m) 


Let s = sgen(hi(am) — ho(am)). Then, the previous inequality implies 


(l-e)E | sup Um—1(h) + Om(® oh) (am) 


om lheH 

1 

< glum—1 (ha) + Um—1(h2) + sl(hy(@m) — he(am))] (Lipschitz property) 
1 1 : 

- 5 lum—1 (ha) + slhi(am)] + 5 [um—1 (ha) — slho(tm)] (rearranging) 
1 1 

< = sup[Um_1i(h) + slh(am)] + = sup[Um—1(h) — slh(am)| (definition of sup) 
2 hen 2 nen 

= SUP Um—1(h) + omlh(2m)] : (definition of E ) 
om LReH Fm 


Since the inequality holds for all « > 0, we have 


E | sup Um—1(R) + Fm(® 0 h)(2m)| <E [ sup epi) + Opt mn) |. 
om LheH om LheH 


Proceeding in the same way for all other o;s («#4 m) proves the lemma. # 


The following is a general margin-based generalization bound that will be used 
in the analysis of several algorithms. 
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Theorem 4.4 Margin bound for binary classification 
Let H be a set of real-valued functions. Fix p > 0, then, for any 6 > 0, with 
probability at least 1 — 6, each of the following holds for allh € H: 


R(h) < By(h) + Rm (H) + as i (4.40) 
R(h) < B,(h) + Ss (H) $3 se 3 (4.41) 


Proof Let H = {z = (2,y) + yh(x): h € H}. Consider the family of functions 
taking values in [0, 1]: 


H={®,of:f eH}. 


By theorem 3.1, with probability at least 1 — 4, for all g € H, 


Blo(2)] < £57 oles) + 2% ali) +f 223 
z = ay m ’ 
g ~ m me g 2m 
and thus, for all h € H, 
log $ 


E[®,(yh(2))] < Rp(h) + Kn (®, oH) + 


2m 


Since lu<o < ®,(u) for all u € R, we have R(h) = Eflyncz)<o] < E[®p(yh(x))], thus 


oe ee log + 
R(h) < Rp(h) + 22m (®, 0 H) + See 
m 


Rm is invariant to a constant shift, therefore we have 
Rin (®, 0H) = Rm((S, — 1) 0 H). 


Since (®, — 1)(0) = 0 and since (@, — 1) is 1/p-Lipschitz as with ®,, by lemma 4.2, 


we have Rm (®, 0 Ht) < “Rm (H) and R,(H) can be rewritten as follows: 


This proves (4.40). The second inequality, (4.41), can be derived in the same way 
by using the second inequality of theorem 3.1, (3.4), instead of (3.3). m 


The generalization bounds of theorem 4.4 shows the conflict between two terms: 
the larger the desired margin p, the smaller the middle term; however, the first 
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term, the empirical margin loss R,, increases as a function of p. The bounds of 

this theorem can be generalized to hold uniformly for all p > 0 at the cost of an 
2 

additional term =e 

theorem with better constants can be derived, see exercise 4.2). 


, as shown in the following theorem (a version of this 


Theorem 4.5 
Let H be a set of real-valued functions. Then, for any 6 > 0, with probability at least 
1 —0, each of the following holds for allh € H and p € (0,1): 


R(h) < Rp(h) + iH) ! = ! [se (4.42) 


m 2m 


z 4x log logs 2 log 4 
< +- | a Be, 
R(h) < Ry(h) + RoC) + yf PP 4 | SE (4.43) 


Proof Consider two sequences (p%)x>1 and (€x)e>1, With ex € (0,1). By theo- 
rem 4.4, for any fixed k > 1, 


Pr R(t — Rh. (hy > —%n(H) che a| < exp(—2me?). (4.44) 


logk 


m ? 


Choose €, = € + then, by the union bound, 


Pr 3x: R(h) — Rp, (h) > 5 Mn(H) + a| < S| exp(—2ime;) 
= ye exp [ — 2m(e + v/ (log k)/m)?] 


< s exp(—2me?) exp(—2 log k) 
k>1 
= es 1/k”) exp(—2me’) 
k>1 
i 2 2 
G oxp(—2me ) < 2exp(—2me’). 


We can choose py = 1/2*. For any p € (0,1), there exists k > 1 such that 
p © (pk; px—i], with po = 1. For that k, p < pr-1 = 2px, thus 1/p, < 2/p 
and logk = ,/loglogs(1/pr) < ./loglog,(2/p). Furthermore, for any h € H, 


Rp, (h) < Rp(h). Thus, 


m 


A ce 4 log log, (2 
Pr |Sk: R(h) — Rig(h) > “n(H) 4 06 log2(2/0) | < 2Qexp(—2me), 


which proves the first statement. The second statement can be proven in a similar 
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way. 


Combining theorem 4.3 and theorem 4.4 gives directly the following general 
margin bound for linear hypotheses with bounded weight vectors, presented in 
corollary 4.1. 


Corollary 4.1 
Let H = {x > w-x: ||w|| < A} and assume that X C {x: ||x|| < r}. Fix p > 0, 
then, for any 6 > 0, with probability at least 1— 6, for anyh € H, 


R(h) < R,(h) 4 ay At ae | se 7 (4.45) 


As with theorem 4.4, the bound of this corollary can be generalized to hold uniformly 


for all p > 0 at the cost of an additional term 4/ peleees by combining theorems 4.3 
and 4.5. This generalization bound for linear hypotheses is remarkable, since it does 
not depend directly on the dimension of the feature space, but only on the margin. 
It suggests that a small generalization error can be achieved when p/r is large (small 
second term) while the empirical margin loss is relatively small (first term). The 
latter occurs when few points are either classified incorrectly or correctly, but with 
margin less than p. 

The fact that the guarantee does not explicitly depend on the dimension of the 
feature space may seem surprising and appear to contradict the VC-dimension lower 
bounds of theorems 3.6 and 3.7. Those lower bounds show that for any learning 
algorithm A there exists a bad distribution for which the error of the hypothesis 
returned by the algorithm is Q(,/d/m) with a non-zero probability. The bound of 
the corollary does not rule out such bad cases, however: for such bad distributions, 
the empirical margin loss would be large even for a relatively small margin p, and 
thus the bound of the corollary would be loose in that case. 

Thus, in some sense, the learning guarantee of the corollary hinges upon the 
hope of a good margin value p: if there exists a relatively large margin value 
p > 0 for which the empirical margin loss is small, then a small generalization 
error is guaranteed by the corollary. This favorable margin situation depends on the 
distribution: while the learning bound is distribution-independent, the existence of 
a good margin is in fact distribution-dependent. A favorable margin seems to appear 
relatively often in applications. 

The bound of the corollary gives a strong justification for margin-maximization 
algorithms such as SVMs. First, note that for p = 1, the margin loss can be upper 
bounded by the hinge loss: 


Va € R, ®)(x) < max(1 — 2,0). (4.46) 
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Using this fact, the bound of the corollary implies that with probability at least 
1—6, for allhe H={xtew-x: ||w]] < A}, 


io r2A2 log + 
Rin) = — yee 2 es (4.47) 
i=1 


where €; = max(1 — y;(w - x;),0). The objective function minimized by the SVM 
algorithm has precisely the form of this upper bound: the first term corresponds to 
the slack penalty over the training set and the second to the minimization of the 
||\w|| which is equivalent to that of ||w||?. Note that an alternative objective function 
would be based on the empirical margin loss instead of the hinge loss. However, the 
advantage of the hinge loss is that it is convex, while the margin loss is not. 

As already pointed out, the bounds just discussed do not directly depend on the 
dimension of the feature space and guarantee good generalization with a favorable 
margin. Thus, they suggest seeking large-margin separating hyperplanes in a very 
high-dimensional space. In view of the form of the dual optimization problems for 
SVMs, determining the solution of the optimization and using it for prediction both 
require computing many inner products in that space. For very high-dimensional 
spaces, the computation of these inner products could become very costly. The 
next chapter provides a solution to this problem which further generalizes SVMs to 
non-linear separation. 


4.5 Chapter notes 


The maximum-margin or optimal hyperplane solution described in section 4.2 
was introduced by Vapnik and Chervonenkis [1964]. The algorithm had limited 
applications, since in most tasks in practice the data is not linearly separable. 
In contrast, the SVM algorithm of section 4.3 for the general non-separable case, 
introduced by Cortes and Vapnik [1995] under the name support-vector networks, 
has been widely adopted and been shown to be effective in practice. The algorithm 
and its theory have had a profound impact on theoretical and applied machine 
learning and inspired research on a variety of topics. Several specialized algorithms 
have been suggested for solving the specific QP that arises when solving the SVM 
problem, for example the SMO algorithm of Platt [1999] (see exercise 4.4) and a 
variety of other decomposition methods such as those used in the LibLinear software 
library [Hsieh et al., 2008], and [Allauzen et al., 2010] for solving the problem when 
using rational kernels (see chapter 5). 

Much of the theory supporting the SVM algorithm ({[Cortes and Vapnik, 1995, 
Vapnik, 1998]), in particular the margin theory presented in section 4.4, has been 
adopted in the learning theory and statistics communities and applied to a variety 
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of other problems. The margin bound on the VC-dimension of canonical hyper- 
planes (theorem 4.2) is by Vapnik [1998], the proof is very similar to Novikoff’s 
margin bound on the number of updates made by the Perceptron algorithm in the 
separable case. Our presentation of margin guarantees based on the Rademacher 
complexity follows the elegant analysis of Koltchinskii and Panchenko [2002] (see 
also Bartlett and Mendelson [2002], Shawe-Taylor et al. [1998]). Our proof of Ta- 
lagrand’s lemma 4.2 is a simpler and more concise version of a more general result 
given by Ledoux and Talagrand [1991, pp. 112-114]. See H6ffgen et al. [1995] for 
hardness results related to the problem of finding a hyperplane with the minimal 
number of errors on a training sample. 


4.6 Exercises 


4.1 Soft margin hyperplanes. The function of the slack variables used in the opti- 
mization problem for soft margin hyperplanes has the form: € ++ 5>\"_, &. Instead, 
we could use £+> S70", €?, with p> 1. 


(a) Give the dual formulation of the problem in this general case. 


(b) How does this more general formulation (p > 1) compare to the standard 
setting (p = 1)? In the case p = 2 is the optimization still convex? 


Sparse SVM. One can give two types of arguments in favor of the SVM algorithm: 
one based on the sparsity of the support vectors, another based on the notion 
of margin. Suppose that instead of maximizing the margin, we choose instead to 
maximize sparsity by minimizing the L, norm of the vector a@ that defines the 
weight vector w, for some p > 1. First, consider the case p = 2. This gives the 
following optimization problem: 


: 1 m , m 
min 5 d az +C 2 & (4.48) 


subject to yi( So ajyixi xy + b) >1-&,i¢€ [l,m] 
j=l 


Ej, Qj >0,7€ [1, mJ]. 


(a) Show that modulo the non-negativity constraint on a, the problem coin- 
cides with an instance of the primal optimization problem of SVM. 


(b) Derive the dual optimization of problem of (4.48). 


(c) Setting p = 1 will induce a more sparse a. Derive the dual optimization in 
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this case. 


4.2 Tighter Rademacher Bound. Derive the following tighter version of the bound 
of theorem 4.5: for any 6 > 0, with probability at least 1 — 6, for all h © H and 
p € (0,1) the following holds: 


a 2 log log, 7 log 2 
R(h) < Rip(h) + Mm (H) i TP 4 se (4.49) 


2m 


for any y > l. 


4.3 Importance weighted SVM. Suppose you wish to use SVMs to solve a learning 
problem where some training data points are more important than others. More 
formally, assume that each training point consists of a triplet (x;,y;,p;), where 
0 <p; < 1 is the importance of the ith point. Rewrite the primal SVM constrained 
optimization problem so that the penalty for mis-labeling a point x; is scaled by the 
priority p;. Then carry this modification through the derivation of the dual solution. 


4.4 Sequential minimal optimization (SMO). The SMO algorithm is an optimiza- 
tion algorithm introduced to speed up the training of SVMs. SMO reduces a (po- 
tentially) large quadratic programming (QP) optimization problem into a series of 
small optimizations involving only two Lagrange multipliers. SMO reduces memory 
requirements, bypasses the need for numerical QP optimization and is easy to im- 
plement. In this question, we will derive the update rule for the SMO algorithm in 
the context of the dual formulation of the SVM problem. 


(a) Assume that we want to optimize equation 4.32 only over a; and a2. Show 
that the optimization problem reduces to 


1 


2 2 
ek ay + a2 5 M1184 5 Hh 220% sky2a1Qa2 — Y1Q V1 — Y2A2Vv2 
1,02 


ns 


Wi (a1,02) 


subject to: 0 < ay,ag <CAa,+sa2=7, 


where y = Y yO s = yyo2 € {-1,+1}, Ki; = (i+ x;) and uy = 
> ajyj Ki; for i= 1,2. 

(b) Substitute the linear constraint a; = 7 — sag into VY; to obtain a new 
objective function V2 that depends only on ag. Show that the a2 that minimizes 
WV. (without the constraints 0 < a1,a2 < C) can be expressed as 


ae 8( Kai — Kya)y + yo(u1 — va) — 3 +1 
2 a F 
1) 
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where n= Ky + Ko = 2Kyo. 
(c) Show that 


V1 — V2 = f(*1) — f(K2) + agyen — sy2y(Ku — Fiz) 


where f(x) = >>y", aty:(xi - x) + b* and a* are values for the Lagrange 
multipliers prior to optimization over a; and a2 (similarly, b* is the previous 
value for the offset). 


(d) Show that 


(y2 — (x2) — (w1 — f(%1)) 

UT 
(e) For s = +1, define L = max{0,y — C} and H = min{C,y} as the lower 
and upper bounds on ag. Similarly, for s = —1, define L = max{0,—y} and 
H = min{C,C — +}. The update rule for SMO involves “clipping” the value of 


x 
a2 = Ay + Yy2 


Qo, i.e., 


ag if lL <a,<H 
os? = 2 LE ifag<L 
H ifag>H 
We subsequently solve for a; such that we satisfy the equality constraint, 
resulting in a, = a* + s(a — a$!"”). Why is “clipping” is required? How are L 
and H derived for the case s = +1? 


4.5 SVMs hands-on. 


(a) Download and install the 1ibsvm software library from: 
http://www.csie.ntu.edu.tw/~cjlin/libsvm/. 
(b) Download the satimage data set found at: 
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. 


Merge the training and validation sets into one. We will refer to the resulting set 
as the training set from now on. Normalize both the training and test vectors. 


(c) Consider the binary classification that consists of distinguishing class 6 
from the rest of the data points. Use SVMs combined with polynomial kernels 
(see chapter 5) to solve this classification problem. To do so, randomly split the 
training data into ten equal-sized disjoint sets. For each value of the polynomial 
degree, d = 1,2,3,4, plot the average cross-validation error plus or minus one 
standard deviation as a function of C (let the other parameters of polynomial 
kernels in libsvm, y and c, be equal to their default values 1). Report the best 
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value of the trade-off constant C' measured on the validation set. 


(d) Let (C*,d*) be the best pair found previously. Fix C to be C*. Plot the 
ten-fold cross-validation training and test errors for the hypotheses obtained 
as a function of d. Plot the average number of support vectors obtained as a 
function of d. 


(e) How many of the support vectors lie on the margin hyperplanes? 


(f) In the standard two-group classification, errors on positive or negative 
points are treated in the same manner. Suppose, however, that we wish to 
penalize an error on a negative point (false positive error) & > 0 times more than 
an error on a positive point. Give the dual optimization problem corresponding 
to SVMs modified in this way. 


(g) Assume that k is an integer. Show how you can use libsvm without writing 
any additional code to find the solution of the modified SVMs just described. 
(h) Apply the modified SVMs to the classification task previously examined 
and compare with your previous SVMs results for k = 2,4, 8,16. 


5 Kernel Methods 


Kernel methods are widely used in machine learning. They are flexible techniques 
that can be used to extend algorithms such as SVMs to define non-linear decision 
boundaries. Other algorithms that only depend on inner products between sample 
points can be extended similarly, many of which will be studied in future chapters. 

The main idea behind these methods is based on so-called kernels or kernel func- 
tions, which, under some technical conditions of symmetry and positive-definiteness, 
implicitly define an inner product in a high-dimensional space. Replacing the orig- 
inal inner product in the input space with positive definite kernels immediately 
extends algorithms such as SVMs to a linear separation in that high-dimensional 
space, or, equivalently, to a non-linear separation in the input space. 

In this chapter, we present the main definitions and key properties of positive 
definite symmetric kernels, including the proof of the fact that they define an inner 
product in a Hilbert space, as well as their closure properties. We then extend the 
SVM algorithm using these kernels and present several theoretical results including 
general margin-based learning guarantees for hypothesis sets based on kernels. We 
also introduce negative definite symmetric kernels and point out their relevance to 
the construction of positive definite kernels, in particular from distances or metrics. 
Finally, we illustrate the design of kernels for non-vectorial discrete structures by 
introducing a general family of kernels for sequences, rational kernels. We describe 
an efficient algorithm for the computation of these kernels and illustrate them with 
several examples. 


5.1 Introduction 


In the previous chapter, we presented an algorithm for linear classification, SVMs, 
which is both effective in applications and benefits from a strong theoretical justi- 
fication. In practice, linear separation is often not possible. Figure 5.la shows an 
example where any hyperplane crosses both populations. However, one can use more 
complex functions to separate the two sets as in figure 5.1b. One way to define such 
a non-linear decision boundary is to use a non-linear mapping ® from the input 
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(a) 


Figure 5.1 Non-linearly separable case. The classification task consists of discrim- 
inating between solid squares and solid circles. (a) No hyperplane can separate the 
two populations. (b) A non-linear mapping can be used instead. 


space ¥ to a higher-dimensional space H, where linear separation is possible. 

The dimension of H can truly be very large in practice. For example, in the 
case of document classification, one may wish to use as features sequences of three 
consecutive words, i.e., trigrams. Thus, with a vocabulary of just 100,000 words, 
the dimension of the feature space H reaches 10!°. On the positive side, the margin 
bounds presented in section 4.4 show that, remarkably, the generalization ability of 
large-margin classification algorithms such as SVMs do not depend on the dimension 
of the feature space, but only on the margin p and the number of training examples 
m. Thus, with a favorable margin p, such algorithms could succeed even in very high- 
dimensional space. However, determining the hyperplane solution requires multiple 
inner product computations in high-dimensional spaces, which can become be very 
costly. 

A solution to this problem is to use kernel methods, which are based on kernels 
or kernel functions. 


Definition 5.1 Kernels 
A function K: ® x X —R is called a kernel over ¥. 


The idea is to define a kernel kK such that for any two points x, a’ € ¥, K(a,2’) be 
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equal to an inner product of vectors (a) and ®(y):4 
Va,c' €X, K(a,2') = (®(x), ®(2')) , (5.1) 


for some mapping ®: ¥ — H to a Hilbert space H called a feature space. Since an 
inner product is a measure of the similarity of two vectors, K is often interpreted 
as a similarity measure between elements of the input space ¥. 

An important advantage of such a kernel K is efficiency: K is often significantly 
more efficient to compute than ® and an inner product in H. We will see several 
common examples where the computation of K(a,«’) can be achieved in O(N) 
while that of (®(2), ®(x’)) typically requires O(dim(H)) work, with dim(H) > N. 
Furthermore, in some cases, the dimension of H is infinite. 

Perhaps an even more crucial benefit of such a kernel function K’ is flexibility: 
there is no need to explicitly define or compute a mapping ®. The kernel K can 
be arbitrarily chosen so long as the existence of ® is guaranteed, i.e. K satisfies 
Mercer’s condition (see theorem 5.1). 


Theorem 5.1 Mercer’s condition 
Let X C RN be a compact set and let K: X x X — R be a continuous and symmetric 
function. Then, K admits a uniformly convergent expansion of the form 


Ke, x’) = .% AnGn(x)bn(2'), 
n=0 


with an > 0 iff for any square integrable function c (c € L2(X)), the following 
condition holds: 


ic e(x)e(a') K(x, x’)dadz’ > 0. 


This condition is important to guarantee the convexity of the optimization problem 
for algorithms such as SVMs and thus convergence guarantees. A condition that 
is equivalent to Mercer’s condition under the assumptions of the theorem is that 
the kernel K be positive definite symmetric (PDS). This property is in fact more 
general since in particular it does not require any assumption about %. In the next 
section, we give the definition of this property and present several commonly used 
examples of PDS kernels, then show that PDS kernels induce an inner product in 
a Hilbert space, and prove several general closure properties for PDS kernels. 


1. To differentiate that inner product from the one of the input space, we will typically 
denote it by (-,-). 
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5.2 Positive definite symmetric kernels 
5.2.1 Definitions 


Definition 5.2 Positive definite symmetric kernels 

A kernel K: X x X& — R is said to be positive definite symmetric (PDS) if for 
any {@1,...,%m} CX, the matric K = [K(a2;,2;)|;; € R"*™ is symmetric positive 
semidefinite (SPSD). 


K is SPSD if it is symmetric and one of the following two equivalent conditions 
holds: 


= the eigenvalues of K are non-negative; 
= for any column vector ¢ = (c1,...,¢m)' € R™!, 
n 
c'Ke = >» Ct (By, Be) = 0. (5.2) 


ig=1 


For a sample S = (a1,...,%m), K = [K(a,2;)]ij € R™*™ is called the kernel 
matrix or the Gram matriz associated to K and the sample S. 

Let us insist on the terminology: the kernel matrix associated to a positive definite 
kernel is positive semidefinite . This is the correct mathematical terminology. 
Nevertheless, the reader should be aware that in the context of machine learning, 
some authors have chosen to use instead the term positive definite kernel to imply 
a positive definite kernel matrix or used new terms such as positive semidefinite 
kernel. 

The following are some standard examples of PDS kernels commonly used in 
applications. 


Example 5.1 Polynomial kernels 
For any constant c > 0, a polynomial kernel of degree d € N is the kernel K defined 
over RX by: 


Vx,x’ ERY, K(x,x’) = (x-x’ +0). (5.3) 


Polynomial kernels map the input space to a higher-dimensional space of dimension 
oe rig (see exercise 5.9). As an example, for an input space of dimension N = 2, 
a second-degree polynomial (d = 2) corresponds to the following inner product in 
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V2 2122 
(1,1, +72, -V2, -V2,1) } (1,1, +V2,+Vv2, + v2, 1) 
C) @ 


V204 


e e 
(1,1, -V2, -V2,+V2,1)] (1,1, -V2,+ V2, —V2,1) 


(b) 


Figure 5.2 Illustration of the XOR classification problem and the use of poly- 
nomial kernels. (a) XOR problem linearly non-separable in the input space. (b) 
Linearly separable using second-degree polynomial kernel. 


dimension 6: 


2 
x? x4 

2 
ne x5 


V2x1 29 V2 2' 21, 
J2ex1 V2ex', 
J2¢ 2x9 V2ex', 


Cc Cc 


Vx,x’ €R?, K(x,x’) = (x12) + rex, +c)? = (5.4) 


Thus, the features corresponding to a second-degree polynomial are the original 
features (x1 and a2), as well as products of these features, and the constant feature. 
More generally, the features associated to a polynomial kernel of degree d are all 
the monomials of degree at most d based on the original features. The explicit 
expression of polynomial kernels as inner products, as in (5.4), proves directly that 
they are PDS kernels. 


To illustrate the application of polynomial kernels, consider the example of fig- 
ure 5.2a which shows a simple data set in dimension two that is not linearly sepa- 
rable. This is known as the XOR problem due to its interpretation in terms of the 
exclusive OR (XOR) function: the label of a point is blue iff exactly one of its coor- 
dinates is 1. However, if we map these points to the six-dimensional space defined 
by a second-degree polynomial as described in (5.4), then the problem becomes 
separable by the hyperplane of equation x,x%2 = 0. Figure 5.2b illustrates that by 
showing the projection of these points on the two-dimensional space defined by their 
third and fourth coordinates. 


Example 5.2 Gaussian kernels 
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For any constant ¢ > 0, a Gaussian kernel or radial basis function (RBF) is the 
kernel K defined over RY by: 


, 2 
Vx,x’ eR, K(x,x’) =exp (--) : (5.5) 


Gaussians kernels are among the most frequently used kernels in applications. We 
will prove in section 5.2.3 that they are PDS kernels and that they can be derived 
xx ). Using the power series 
expansion of the function exponential, we can rewrite the expression of K’ as follows: 


by normalization from the kernels K’: (x, x’) +> exp ( 


+oo 
Vx,x’/eR’, K'(x,x')= S- 


n=0 


(x: x)" 


orn! ? 


which shows that the kernels AK’, and thus Gaussian kernels, are positive linear 
combinations of polynomial kernels of all degrees n > 0. 


Example 5.3 Sigmoid kernels 
For any real constants a,b > 0, a sigmoid kernel is the kernel K defined over RN 
by: 


Vx,x’€ RN, K(x,x’) = tanh (a(x-x’) +). (5.6) 


Using sigmoid kernels with SVMs leads to an algorithm that is closely related to 
learning algorithms based on simple neural networks, which are also often defined 
via a sigmoid function. When a < 0 or b < 0, the kernel is not PDS and the 
corresponding neural network does not benefit from the convergence guarantees of 
convex optimization (see exercise 5.15). 


5.2.2 Reproducing kernel Hilbert space 
Here, we prove the crucial property of PDS kernels, which is to induce an inner 
product in a Hilbert space. The proof will make use of the following lemma. 


Lemma 5.1 Cauchy-Schwarz inequality for PDS kernels 
Let K be a PDS kernel. Then, for any x,2' € &, 


K(a,2')? < K(z,2)K(2',2’). (5.7) 


Proof Consider the matrix K = ( ee soaye By definition, if K is PDS, 
then K is SPSD for all x,2’ € &. In particular, the product of the eigenvalues of 


K, det(K), must be non-negative, thus, using K(«’,«) = K(a,2’), we have 


det(K) = K(a,x)K(a',2') — K(z,2’)? > 0, 
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which concludes the proof. 
The following is the main result of this section. 


Theorem 5.2 Reproducing kernel Hilbert space (RKHS) 
Let kK: X x X — R be a PDS kernel. Then, there exists a Hilbert space H and a 
mapping ® from X to H such that: 


Va,c EX, K(2,2') = (®(2), O(z')). (5.8) 
Furthermore, H has the following property known as the reproducing property: 
VheEH Va e X, h(x) = (h, K(a,:)). (5.9) 
HH is called a reproducing kernel Hilbert space (RKHS) associated to K. 
Proof For any x € X, define ®(x): X — R as follows: 


Val € X, ®(x)(2') = K(az,2’). 


We define Hp as the set of finite linear combinations of such functions ®(2): 


- { Daw(e: a; € Riv; € X,card(1) < co} 


wel 


Now, we introduce an operation (-,-) on Hp x Ho defined for all f,g € Ho with 
f= Vier UP(ai) and g = 975.7 bj P(2;) by 
LOo@= > ahjke2) => ba) => aoe): 
i€L, jE je i€l 


By definition, (-,-) is symmetric. The last two equations show that (f,g) does not 
depend on the particular representations of f and g, and also show that (-,-) is 
bilinear. Further, for any f = }),-,; ai®(a;) € Ho, since K is PDS, we have 


=>) ajay k (xj, 5) )>0. 

age] 
Thus, (-,-) is positive semidefinite bilinear form. This inequality implies more 
generally using the bilinearity of (-,-) that for any f,,..., fm and c1,...,¢m € R, 


m 


2 aaah= (Let De) 


i,j=l i=l 


Hence, (-,-) is a PDS kernel on Hp. Thus, for any f € Ho and any a € 4, by 
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lemma 5.1, we can write 


(f,B(2))? < (fF, f)(®(z), (2). 


Further, we observe the reproducing property of (-,-): for any f = se aiB(a;) € 
Ho, by definition of (-,-), 


Vee XxX, f(x) = 5 > aK (a,x) = (f,®(z)). (5.10) 
i€l 

Thus, [f(ax)]? < (f, f)K(x,2) for all 2 € 4, which shows the definiteness of (-,-). 
This implies that (-,-) defines an inner product on Ho, which thereby becomes a 
pre-Hilbert space. Hp can be completed to form a Hilbert space H in which it is 
dense, following a standard construction. By the Cauchy-Schwarz inequality , for 
any © € ¥, f + (f, ®(z)) is Lipschitz, therefore continuous. Thus, since Hp is dense 
in H, the reproducing property (5.10) also holds over H. m 


The Hilbert space H defined in the proof of the theorem for a PDS kernel K is called 
the reproducing kernel Hilbert space (RKHS) associated to kK. Any Hilbert space H 
such that there exists ®: ¥ — H with K(a,2’) = (®(x), ®(2’)) for all x,2’ © ¥ 
is called a feature space associated to K and ® is called a feature mapping. We 
will denote by || - ||q the norm induced by the inner product in feature space H: 
||w|ln = \/ (w, w) for all w € H. Note that the feature spaces associated to K are in 
general not unique and may have different dimensions. In practice, when referring to 
the dimension of the feature space associated to K, we either refer to the dimension 
of the feature space based on a feature mapping described explicitly, or to that of 
the RKHS associated to K. 

Theorem 5.2 implies that PDS kernels can be used to implicitly define a feature 
space or feature vectors. As already underlined in previous chapters, the role played 
by the features in the success of learning algorithms is crucial: with poor features, 
uncorrelated with the target labels, learning could become very challenging or 
even impossible; in contrast, good features could provide invaluable clues to the 
algorithm. Therefore, in the context of learning with PDS kernels and for a fixed 
input space, the problem of seeking useful features is replaced by that of finding 
useful PDS kernels. While features represented the user’s prior knowledge about the 
task in the standard learning problems, here PDS kernels will play this role. Thus, 
in practice, an appropriate choice of PDS kernel for a task will be crucial. 


5.2.3 Properties 
This section highlights several important properties of PDS kernels. We first show 


that PDS kernels can be normalized and that the resulting normalized kernels are 
also PDS. We also introduce the definition of empirical kernel maps and describe 
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their properties and extension. We then prove several important closure properties 
of PDS kernels, which can be used to construct complex PDS kernels from simpler 
ones. 

To any kernel K, we can associate a normalized kernel K' defined by 


ee ’ ; 0 if (K(a,2) = 0) A (K(2’, 2’) = 0) 
De Se a __K@#) _ otherwise. 

K (a,x) K (a’,x’) 

(5.11) 

By definition, for a normalized kernel kK’, K'(x,2) = 1 for all a € ¥ such that 

K (a,x) # 0. An example of normalized kernel is the Gaussian kernel with parameter 


a > 0, which is the normalized kernel associated to K’: (x,x’) + exp (74 s ie 


K! ’ xxl rye 
vx, x! ERY, (Ge, 2c’) 5 (--= =") (5.12) 


Lemma 5.2 Normalized PDS kernels 
Let K be a PDS kernel. Then, the normalized kernel K' associated to K is PDS. 


Proof Let {1,...,%m}C & and let c be an arbitrary vector in R™. We will show 
that the sum )77",-4 
then K(2;,x;) = 0 and thus K’(a;,2;) =0 for all 7 € [1,m]. Thus, we can assume 


that K(x;,2;) > 0 for all i € [1,m]. Then, the sum can be rewritten as follows: 


cic; K'(x;,2;) is non-negative. By lemma 5.1, if K(2;,2;) = 0 


m 


a cicj K (ai, 25) _ =: cic; (® 
af K (23, %4)K (ey, £3) | P(x; ia . aE 


ij=l ij=l 


>0 


o] 


where ® is a feature mapping associated to K, which exists by theorem 5.2. @ 


As indicated earlier, PDS kernels can be interpreted as a similarity measure since 
they induce an inner product in some Hilbert space H. This is more evident for a 
normalized kernel AK since K(x, 2’) is then exactly the cosine of the angle between 
the feature vectors ®(x) and ®(2’), provided that none of them is zero: ®(x) and 
®(z’) are then unit vectors since ||®(x) || = ||®(2’)||_ = WK (2,x) =1. 

While one of the advantages of PDS kernels is an implicit definition of a feature 
mapping, in some instances, it may be desirable to define an explicit feature 
mapping based on a PDS kernel. This may be to work in the primal for various 
optimization and computational reasons, to derive an approximation based on an 
explicit mapping, or as part of a theoretical analysis where an explicit mapping 
is more convenient. The empirical kernel map ® associated to a PDS kernel K is 
a feature mapping that can be used precisely in such contexts. Given a training 
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sample containing points 41,...,% € V, ®: XY — R"” is defined for all x € XY by 


K(a,21) 
P(x) = 
K(2,2m) 


Thus, ®(x) is the vector of the K-similarity measures of x with each of the training 
points. Let K be the kernel matrix associated to K and e; the ith unit vector. 
Note that for any i € [l,m], ®(a;) is the ith column of K, that is ®(x;) = Ke;. In 
particular, for all 7,7 € [1,m] 


(®(x;), ®(aj)) = (Ke;)' (Ke;) = e/ Key. 


Thus, the kernel matrix K’ associated to ® is K?. It may desirable in some cases 
to define a feature mapping whose kernel matrix coincides with K. Let Kt? denote 
the SPSD matrix whose square is K’, the pseudo-inverse of K. Kt? can be derived 
from Kt via singular value decomposition and if the matrix K is invertible, Kt? 
coincides with K~!/? (see appendix A for properties of the pseudo-inverse). Then, 
W can be defined as follows using the empirical kernel map ®: 


Ve EX, W(x) = Kt? (2). 


Using the identity KK'K = K valid for any symmetric matrix K, for all i, 7 € [1, ml}, 
the following holds: 


i i 
(W(a;), U(a;)) = (KT? Ke;) ' (K'? Ke,;) =e} KK'Ke; = e} Ke,. 


Thus, the kernel matrix associated to W is K. Finally, note that for the feature 
mapping 2: ¥ — R™ defined by 


Vee X, (x) = K'8(z), 


for all i,j € [1, m], we have (Q(2;),0Q(a;)) =e) KK'K'Ke; =e} KK'e,, using the 
identity K'K'K = Kt valid for any symmetric matrix K. Thus, the kernel matrix 
associated to Q is KK?, which reduces to the identity matrix I ¢ R™*™ when K is 
invertible, since Kt = K~! in that case. 

As pointed out in the previous section, kernels represent the user’s prior knowl- 
edge about a task. In some cases, a user may come up with appropriate similarity 
measures or PDS kernels for some subtasks — for example, for different subcat- 
egories of proteins or text documents to classify. But how can he combine these 
PDS kernels to form a PDS kernel for the entire class? Is the resulting combined 
kernel guaranteed to be PDS? In the following, we will show that PDS kernels are 
closed under several useful operations which can be used to design complex PDS 
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kernels. These operations are the sum and the product of kernels, as well as the 
tensor product of two kernels K and K’, denoted by K @ Kk’ and defined by 


Va1,%2,0,,75 EX, (K@K')(x1, 2, 22,05) = K(21,22)K' (x), 2°). 


They also include the pointwise limit: given a sequence of kernels (A;,),en such that 
for all x,’ € X (Ky,(2,2'))nen admits a limit, the pointwise limit of (Ky,)nen is 
the kernel K defined for all x, 2’ € ¥ by K(a,2’) = limp_.4.0(Kn)(2, x’). Similarly, 
if ar G,x” is a power series with radius of convergence p > 0 and K a kernel 
taking values in (—p,+p), then }>°° a,” is the kernel obtained by composition 
of K with that power series. The following theorem provides closure guarantees for 
all of these operations. 


Theorem 5.3 PDS kernels — closure properties 
PDS kernels are closed under sum, product, tensor product, pointwise limit, and 
composition with a power series peak adnx” with an > 0 for alln EN. 


Proof We start with two kernel matrices, K and K’, generated from PDS kernels 
K and K’ for an arbitrary set of m points. By assumption, these kernel matrices 
are SPSD. Observe that for any c € R™™*?, 


(c'Ke > 0) A(c'K’c > 0) > e'(K+K’)c > 0. 


By (5.2), this shows that K + K’ is SPSD and thus that K + K’ is PDS. To show 
closure under product, we will use the fact that for any SPSD matrix K there exists 
M such that K = MM". The existence of M is guaranteed as it can be generated 
via, for instance, singular value decomposition of K, or by Cholesky decomposition. 
The kernel matrix associated to KK’ is (KijK},);;. For any c € R™™', expressing 
K;,; in terms of the entries of M, we can write 


Yo cies(KiyKij) = SO o( [> Mam K;,) 


i,j=1 i,j=l k=1 
m m 
=35] DO ceMaMyk, 
k=1 bi,j=1 
m 
= y- z, K'z;, > 0, 
k=1 
ciMir 
with zp = : I This shows that PDS kernels are closed under product. 
2M tc 


The tensor product of K and K’ is PDS as the product of the two PDS kernels 
(@1, 24, €2, 04) b> K (a1, 22) and (11,24, 09,25) i K'(y1, Y2)- Next, let (Kn) nen 
be a sequence of PDS kernels with pointwise limit kK. Let K be the kernel matrix 
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associated to K and K,, the one associated to K,, for any n € N. Observe that 


(Vn, c'K,c> 0) > lim c'K,c=c'Kce> 0. 


n— co 


This shows the closure under pointwise limit. Finally, assume that K is a PDS 
kernel with |K (a, 2")| < p for all xx’ € ¥ and let f: r> 0 9 an 2”, a, > 0 bea 
power series with radius of convergence p. Then, for any néN, K” and thus a,K, 
are PDS by closure under product. For any N € N, = 9 Ani” is PDS by closure 
under sum of a,K,s and fo K is PDS by closure under the limit of > 9 an K™ 
as N tends to infinity. m 


The theorem implies in particular that for any PDS kernel matrix K, exp(K) is 
PDS, since the radius of convergence of exp is infinite. In particular, the kernel 
K': (x,x’) + exp (== ) is PDS since (x,x’) 1 xa is PDS. Thus, by lemma 5.2, 
this shows that a Gaussian kernel, which is the normalized kernel associated to K’, 
is PDS. 


5.3 Kernel-based algorithms 


In this section we discuss how SVMs can be used with kernels and analyze the 
impact that kernels have on generalization. 


5.3.1 SVMs with PDS kernels 


In chapter 4, we noted that the dual optimization problem for SVMs as well as the 
form of the solution did not directly depend on the input vectors but only on inner 
products. Since a PDS kernel implicitly defines an inner product (theorem 5.2), we 
can extend SVMs and combine it with an arbitrary PDS kernel K by replacing each 
instance of an inner product x«-2’ with K(«,2’). This leads to the following general 
form of the SVM optimization problem and solution with PDS kernels extending 
(4.32): 


max a S avajyiy; K(a;,2;) (5.13) 
i=1 are 1 


subject to:0 << aj <CA 5 an = 0,7 € [1,m]. 
i=l 


In view of (4.33), the hypothesis h solution can be written as: 


= sen 63 aiyiK (aj,xv) + b), (5.14) 
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with b = yj — an a;yjK(x;,x;) for any x; with 0 < a; < C. We can rewrite 
the optimization problem (5.13) in a vector form, by using the kernel matrix K 
associated to K for the training sample (21,...,2m) as follows: 


max 21'a—(acy)'K(aocy) (5.15) 
subject to: O<a<CA aly = 0. 


In this formulation, ao y is the Hadamard product or entry-wise product of the 
vectors a and y. Thus, it is the column vector in R™*! whose ith component 
equals a;y;. The solution in vector form is the same as in (5.14), but with b = 
yi — (acy)! Ke; for any 2; with 0 < a; < C. 

This version of SVMs used with PDS kernels is the general form of SVMs we 
will consider in all that follows. The extension is important, since it enables an 
implicit non-linear mapping of the input points to a high-dimensional space where 
large-margin separation is sought. 

Many other algorithms in areas including regression, ranking, dimensionality 
reduction or clustering can be extended using PDS kernels following the same 
scheme (see in particular chapters 8, 9, 10, 12). 


5.3.2 Representer theorem 


Observe that modulo the offset b, the hypothesis solution of SVMs can be written 
as a linear combination of the functions A(2;,-), where x; is a sample point. The 
following theorem known as the representer theorem shows that this is in fact a 
general property that holds for a broad class of optimization problems, including 
that of SVMs with no offset. 


Theorem 5.4 Representer theorem 

Let K: ¥ x X —R be a PDS kernel and H its corresponding RKHS. Then, for any 
non-decreasing function G: R — R and any loss function L: R™ — RU {+00}, the 
optimization problem 


argmin F(h) = argmin G(|{Al|m) + L(h(a1),...,h(am)) 
hell hell 


admits a solution of the form h* = >>", a; K(xi,-). If G is further assumed to be 
increasing, then any solution has this form. 


Proof Let Hy = span({K(a;,-): i € [1, m]}). Any h € H admits the decomposition 
h=h, +ht according to H = H, @ Hy, where © is the direct sum. Since G is 
non-decreasing, G(||hi||m) < G(/|/hill& + ]2+11%) = G(|Alla). By the reproducing 
property, for all 7 € [l,m], h(a;) = (h, K(a;,-)) = (hi, K(a;,-)) = hi(a;). Thus, 
L(h(x1),..-,h(@m)) = L(hi(21),.--,A1(@m)) and F(h1) < F(A). This proves the 


102 Kernel Methods 


first part of the theorem. If G is further increasing, then F(h,) < F(h) when 
||h+ || > 0 and any solution of the optimization problem must be in Hy. m= 


5.3.3 Learning guarantees 


Here, we present general learning guarantees for hypothesis sets based on PDS 
kernels, which hold in particular for SVMs combined with PDS kernels. 

The following theorem gives a general bound on the empirical Rademacher 
complexity of kernel-based hypotheses with bounded norm, that is a hypothesis 
set of the form H = {h © Hl: ||hllm < A}, for some A > 0, where H is the 
RKHS associated to a kernel K. By the reproducing property, any h € H is of 
the form x + (h, K(x,-)) = (h, ®(x)) with ||h||_ < A, where © is a feature mapping 
associated to K, that is of the form 7+ (w, ®(x)) with ||w||q < A. 


Theorem 5.5 Rademacher complexity of kernel-based hypotheses 

Let K: X& x X — R be a PDS kernel and let ®: X — HI be a feature mapping 
associated to K. Let S C {a: K(a,x) < r?} be a sample of size m, and let 
H={xrew- ®(z): ||w|l_ < A} for some A> 0. Then 


= A,/Tr[K 2\? 
rc eens as ea (5.16) 
m m 
Proof The proof steps are as follows: 
A 1 m 
Rs(H) = —E | sup (w,> 0; ®(2;) | 
es an > ) 
5 bs ®(x;) (Cauchy-Schwa se) 
=—E 2 vi)}) uchy-Schwarz , eq. cas 
A Sa Oh 2471/2 
= (x; °s ineq. 
=. E | 2, a; ®(x;) a] (Jensen’s ineq.) 


(i Aj = Eloio,] = 0) 


| 
BR 
~ 
iw) 


Ebsco 


= a E Kea) 


_ AVT:[K] Pp r2? 


m m 


———— 
BR 
os 
bo 


The initial equality holds by definition of the empirical Rademacher complexity 
(definition 3.2). The first inequality is due to the Cauchy-Schwarz inequality and 
||w|lm < A. The following inequality results from Jensen’s inequality (theorem B.4) 
applied to the concave function ./-. The subsequent equality is a consequence of 
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E,|oi0;] = Eloi] Eo[o;] = 0 for i # j, since the Rademacher variables o; and 
a; are independent. The statement of the theorem then follows by noting that 
Tr[K] <mr?. 


The theorem indicates that the trace of the kernel matrix is an important quantity 
for controlling the complexity of hypothesis sets based on kernels. Observe that 
by the iersie te aca inequality (D.22), the empirical Rademacher complexity 


Rs(H) = 4 Eo||| 2”, 01 P(2;)||u] can also be lower bounded by A ay ae, which 
only differs from the upper bound found by the constant 5: Also, note that if 
K (a,x) < r? for all x € &, then the inequalities 5.16 hold for all samples S. 

The bound of theorem 5.5 or the inequalities 5.16 can be plugged into any of the 
Rademacher complexity generalization bounds presented in the previous chapters. 
In particular, in combination with theorem 4.4, they lead directly to the following 


margin bound similar to that of corollary 4.1. 


Corollary 5.1 Margin bounds for kernel-based hypotheses 

Let K: X¥ x X —R be a PDS kernel with r = sup,cy K(x, 2x). Let ®: X — H be a 
feature mapping associated to K and let H = {x-w-®(z): ||wllmg < A} for some 
A>0. Fir p> 0. Then, for any 6 > 0, each of the following statements holds with 
probability at least 1— 6 for anyh€ H: 


eS 2A2 / 2 log 4 
R(h) < R,(h) + 24/7 a + — (5.17) 


R(h) < B,(h) +2 ee 34/088, (5.18) 


2m 


5.4 Negative definite symmetric kernels 


Often in practice, a natural distance or metric is available for the learning task 
considered. This metric could be used to define a similarity measure. As an example, 
Gaussian kernels have the form exp(—d?), where d is a metric for the input vector 
space. Several natural questions arise such as: what other PDS kernels can we 
construct from a metric in a Hilbert space? What technical condition should d 
satisfy to guarantee that exp(—d?) is PDS? A natural mathematical definition that 
helps address these questions is that of negative definite symmetric (NDS) kernels. 


Definition 5.3 Negative definite symmetric (NDS) kernels 
A kernel K: X& x X — R is said to be negative-definite symmetric (NDS) if it 
is symmetric and if for all {11,...,2m} C X andec € R™! with 1'c = 0, the 
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following holds: 
c'Ke < 0. 


Clearly, if K is PDS, then —Kk is NDS, but the converse does not hold in general. 
The following gives a standard example of an NDS kernel. 


Example 5.4 Squared distance — NDS kernel 
The squared distance (x, 2’) + ||x’ — x||? in RY defines an NDS kernel. Indeed, let 


cé€ R™? with 377", c; = 0. Then, for any {x1,...,%m} C ¥, we can write 
m m 
2 2 a 
S 7 ce3|[te — xy = D2 exe, (lla? + [lxy||? — 2x5 +5) 
ij=l tj=1 
m m m 
= oe -||2 2) ae . 
— ) cic; (|[Xa||" + |x, ||7) 2S ax: Sex 
i,j=l i=l j=l 
m m 
> 2 2 2 
= Ci; ( Xj 1 Ixy ) — 2\| S° cx: 
ij=l i=1 
m 
2 2 
< S& exe; (llxil|? + lhx;lI?) 


= (Yer) (Lalas?) + (es) (Qaalbsil?) =o. 
j=l i=1 i=1 j=l 
The next theorems show connections between NDS and PDS kernels. These 


results provide another series of tools for designing PDS kernels. 


Theorem 5.6 
Let K' be defined for any xo by 


K'(a,2') = K(a,x0) + K(a', 20) — K(2,2') — K(a0, 20) 
for all x,a' € X. Then K is NDS iff K' is PDS. 


Proof Assume that K’ is PDS and define K such that for any x9 we have 
K(a,2') = K(ax,x0) + K(ao,2') — K(a0,20) — K’(a,2’). Then for any c € R™ 


such that ¢'1 = 0 and any set of points (71,...,%m) € ¥™” we have 
3 cic; K (ai, 25) = (Dax(e.a0))(doe i) + (Lia) (Qe K(xo,3)) 
t,g=1 i=1 g=1 t=1 g=1 


m m 


2 m 
- ( ) ci) K( (x0, Xo) 2 jc; K" (24,25) =2 a (24,5) <0. 


i=l al 
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which proves Kk is NDS. 

Now, assume K is NDS and define K’ for any xg as above. Then, for any c € R™, 
we can define cy = —c! 1 and the following holds by the NDS property for any points 
(@1,.--,%m) € &™ as well as xo defined previously: iio cc; K(a;,2;) < 0. This 
implies that 


(Settes.20) (> ci) + 63 ci) O35 cj K (wo, 23)) 


j=0 1=0 j=0 
m 2 m m 
» ) / P / 
= ( ci) K (2x0, Xo) —_ cicj Kk (xi, 2;) = — cjcej Kk (xj, 25) < 0, 
i=0 4,7=0 i,j=0 


which implies 2 gat cjej K'(a;, 25) > —2c9 reg AK" (vi, Zo) + GK"(x0, Zo) = O. 
The equality holds since Vz € ¥, K'(a,29) =0. @ 


This theorem is useful in showing other connections, such the following theorems, 
which are left as exercises (see exercises 5.14 and 5.15). 


Theorem 5.7 
Let K: Xx X +R be a symmetric kernel. Then, K is NDS iff exp(—tk) is a PDS 
kernel for all t > 0. 


The theorem provides another proof that Gaussian kernels are PDS: as seen earlier 
(Example 5.4), the squared distance (2,2’) > ||x — 2’||? in R% is NDS, thus 
(x, 2’) + exp(—t||a — 2’||?) is PDS for all ¢ > 0. 

Theorem 5.8 

Let K: X x & +R be an NDS kernel such that for all x,a' € X,K(x,2') = 0 iff 


x = x'. Then, there exists a Hilbert space H and a mapping ®: X — H such that 
for alla,a’ € %, 


K(2,2") = ||®(x) — O(2') |’. 
Thus, under the hypothesis of the theorem, VK defines a metric. 


This theorem can be used to show that the kernel (a, 2’) + exp(—|x — 2’ |?) in R 
is not PDS for p > 2. Otherwise, for any t > 0, {21,...,2m} C X¥ and c € R™!, 
we would have: 


m m 

—tla;—2,;|" —|tl/ Py, —t1/P gy. |P 
y excje Heal? = y eczema! > 0, 
ij=l ij=l 


This would imply that (x, 2’) + |a~ — a’|? is NDS for p > 2, which can be proven 
(via theorem 5.8) not to be valid. 
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5.5 Sequence kernels 


The examples given in the previous sections, including the commonly used poly- 
nomial or Gaussian kernels, were all for PDS kernels over vector spaces. In many 
learning tasks found in practice, the input space ¥ is not a vector space. The ex- 
amples to classify in practice could be protein sequences, images, graphs, parse 
trees, finite automata, or other discrete structures which may not be directly given 
as vectors. PDS kernels provide a method for extending algorithms such as SVMs 
originally designed for a vectorial space to the classification of such objects. But, 
how can we define PDS kernels for these structures? 

This section will focus on the specific case of sequence kernels, that is, kernels 
for sequences or strings. PDS kernels can be defined for other discrete structures 
in somewhat similar ways. Sequence kernels are particularly relevant to learning 
algorithms applied to computational biology or natural language processing, which 
are both important applications. 

How can we define PDS kernels for sequences, which are similarity measures for 
sequences? One idea consists of declaring two sequences, e.g., two documents or 
two biosequences, as similar when they share common substrings or subsequences. 
One example could be the kernel between two sequences defined by the sum 
of the product of the counts of their common substrings. But which substrings 
should be used in that definition? Most likely, we would need some flexibility in 
the definition of the matching substrings. For computational biology applications, 
for example, the match could be imperfect. Thus, we may need to consider some 
number of mismatches, possibly gaps, or wildcards. More generally, we might need 
to allow various substitutions and might wish to assign different weights to common 
substrings to emphasize some matching substrings and deemphasize others. 

As can be seen from this discussion, there are many different possibilities and 
we need a general framework for defining such kernels. In the following, we will 
introduce a general framework for sequence kernels, rational kernels, which will 
include all the kernels considered in this discussion. We will also describe a general 
and efficient algorithm for their computation and will illustrate them with some 
examples. 

The definition of these kernels relies on that of weighted transducers. Thus, we 
start with the definition of these devices as well as some relevant algorithms. 


5.5.1 Weighted transducers 
Sequence kernels can be effectively represented and computed using weighted trans- 


ducers. In the following definition, let denote a finite input alphabet, A a finite 
output alphabet, and ¢ the empty string or null label, whose concatenation with 
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Figure 5.3. Example of weighted transducer. 


any string leaves it unchanged. 


Definition 5.4 

A weighted transducer T is a 7-tuple T = (©,A,Q,1,F,E,p) where X is a finite 
input alphabet, A a finite output alphabet, Q is a finite set of states, I C Q the 
set of initial states, F C Q the set of final states, E a finite multiset of transitions 
elements of Q x (SU {e}) x (AU {e}) x Rx Q, and p: F > R a final weight function 
mapping F to R. The size of transducer T is the sum of its number of states and 
transitions and is denoted by |T|.2 


Thus, weighted transducers are finite automata in which each transition is labeled 
with both an input and an output label and carries some real-valued weight. 
Figure 5.3 shows an example of a weighted finite-state transducer. In this figure, 
the input and output labels of a transition are separated by a colon delimiter, and 
the weight is indicated after the slash separator. The initial states are represented 
by a bold circle and final states by double circles. The final weight p[q] at a final 
state q is displayed after the slash. 

The input label of a path a is a string element of * obtained by concatenating 
input labels along 7. Similarly, the output label of a path a is obtained by 
concatenating output labels along 7. A path from an initial state to a final state is 
an accepting path. The weight of an accepting path is obtained by multiplying the 
weights of its constituent transitions and the weight of the final state of the path. 

A weighted transducer defines a mapping from »©* x A* to R. The weight 
associated by a weighted transducer T to a pair of strings (x,y) € X* x A* is 
denoted by T(a,y) and is obtained by summing the weights of all accepting paths 


2. A multiset in the definition of the transitions is used to allow for the presence of several 
transitions from a state p to a state q with the same input and output label, and even the 
same weight, which may occur as a result of various operations. 
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with input label x and output label y. For example, the transducer of figure 5.3 
associates to the pair (aab, baa) the weight 3 x 1x 4x 2+3x2 x3 2, since there 
is a path with input label aab and output label baa and weight 3 x 1 x 4 x 2, and 
another one with weight 3 x 2 x 3 x 2. 

The sum of the weights of all accepting paths of an acyclic transducer, that 
is a transducer T with no cycle, can be computed in linear time, that is O(|T'), 
using a general shortest-distance or forward-backward algorithm. These are simple 
algorithms, but a detailed description would require too much of a digression from 
the main topic of this chapter. 


Composition An important operation for weighted transducers is composition, 
which can be used to combine two or more weighted transducers to form more 
complex weighted transducers. As we shall see, this operation is useful for the 
creation and computation of sequence kernels. Its definition follows that of com- 
position of relations. Given two weighted transducers T, = (©, A, Qi, 1, Fi, £1, 1) 
and Ty = (A,2, Qe, Ie, Fo, F2, p2), the result of the composition of T, and T> is a 
weighted transducer denoted by T; o 7 and defined for all « € }* and y € * by 


(T, 0T2)(a,y)= S° Ti(w,z) -To(z,y), (5.19) 
zEA* 


where the sum runs over all strings z over the alphabet A. Thus, composition is 
similar to matrix multiplication with infinite matrices. 

There exists a general and efficient algorithm to compute the composition of two 
weighted transducers. In the absence of es on the input side of JT; or the output 
side of Ty, the states of T, o Th = (©,A,Q,I, F, E,p) can be identified with pairs 
made of a state of JT; and a state of Th, Q C Qi X Qo. Initial states are those 
obtained by pairing initial states of the original transducers, [1 = I, x Iz, and 
similarly final states are defined by F = QM (F\ x F2). The final weight at a state 
(a1, 92) € Fy x F> is p(q) = pi(qi)p2(q2), that is the product of the final weights at 
q and q2. Transitions are obtained by matching a transition of T; with one of T 
from appropriate transitions of T, and T»: 


B= WW { ((asdidoaresun @ wn, (aas04)) b 


(q1,4,b,w1,q2)EF1 

(q},b,¢,we,q5)€ Ee 
Here, W denotes the standard join operation of multisets as in {1,2} W {1,3} = 
{1,1,2,3}, to preserve the multiplicity of the transitions. 

In the worst case, all transitions of JT; leaving a state gq; match all those of T3 
leaving state gi, thus the space and time complexity of composition is quadratic: 
O(|T1||T2|). In practice, such cases are rare and composition is very efficient. 
Figure 5.4 illustrates the algorithm in a particular case. 
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Figure 5.4 (a) Weighted transducer T;. (b) Weighted transducer T>. (c) Result 
of composition of T,; and T2, T; o T2. Some states might be constructed during the 
execution of the algorithm that are not co-accessible, that is, they do not admit a 
path to a final state, e.g., (3,2). Such states and the related transitions (in red) can 
be removed by a trimming (or connection) algorithm in linear time. 


As illustrated by figure 5.5, when 7; admits output ¢€ labels or T> input € labels, 
the algorithm just described may create redundant e-paths, which would lead to 
an incorrect result. The weight of the matching paths of the original transducers 
would be counted p times, where p is the number of redundant paths in the result 
of composition. To avoid with this problem, all but one ¢-path must be filtered out 
of the composite transducer. Figure 5.5 indicates in boldface one possible choice for 
that path, which in this case is the shortest. Remarkably, that filtering mechanism 
itself can be encoded as a finite-state transducer F’ (figure 5.5b). 

To apply that filter, we need to first augment 7, and T> with auxiliary symbols 
that make the semantics of € explicit: let T, (T>) be the weighted transducer obtained 
from T, (respectively T>) by replacing the output (respectively input) € labels with 
€2 (respectively €,) as illustrated by figure 5.5. Thus, matching with the symbol €, 
corresponds to remaining at the same state of 7, and taking a transition of T> with 
input €. €2 can be described in a symmetric way. The filter transducer F disallows a 
matching (€2, €2) immediately after (€,,€,) since this can be done instead via (€2, €1). 
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©) aA(T) b:8(2) C83) aa@) (0) ad7) E82) ©) 


e:el e:el e:el e:el eel g2re e2:e e2:e 
Om ip\ bea fa) ce 3 \ did Om bere eo) aa) 
T1 To 


(b) 


Figure 5.5 Redundant «paths in composition. All transition and final weights are 
equal to one. (a) A straightforward generalization of the e-free case would generate 
all the paths from (1,1) to (3,2) when composing T; and T2 and produce an incorrect 
results in non-idempotent semirings. (b) Filter transducer F. The shorthand z is 
used to represent an element of ©. 


By symmetry, it also disallows a matching (€,, €,) immediately after (€2,€2). In the 
same way, a matching (€,,¢€,) immediately followed by (€2,€,) is not permitted 
by the filter F since a path via the matchings (€2, €1)(€1, €1) is possible. Similarly, 
(€2, €2)(€2, €1) is ruled out. It is not hard to verify that the filter transducer F' is 
precisely a finite automaton over pairs accepting the complement of the language 


L = 0" ((e1, €1)(€2, €2) + (€2, €2)(€1, €1) + (€1, €1) (€2, €1) + (€2, €2)(€2, €1) 0", 


where o = {(€1,€1), (€2, €2), (€2,€1), a}. Thus, the filter F’ guarantees that exactly 
one e-path is allowed in the composition of each € sequences. To obtain the correct 
result of composition, it suffices then to use the e-free composition algorithm already 
described and compute 


7, oF o To (5.20) 


Indeed, the two compositions in T; o F o T> no longer involve es. Since the size of 
the filter transducer F' is constant, the complexity of general composition is the 
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same as that of «free composition, that is O(|T\||T2|). In practice, the augmented 
transducers fe and a5 are not explicitly constructed, instead the presence of the 
auxiliary symbols is simulated. Further filter optimizations help limit the number of 
non-coaccessible states created, for example, by examining more carefully the case 
of states with only outgoing non-e-transitions or only outgoing ¢-transitions. 


5.5.2 Rational kernels 


The following establishes a general framework for the definition of sequence kernels. 


Definition 5.5 Rational kernels 
A kernel kK: &* x &* — R is said to be rational if it coincides with the mapping 
defined by some weighted transducer U: Vx, y € O*, K(x, y) = U(a, y). 


Note that we could have instead adopted a more general definition: instead of using 
weighted transducers, we could have used more powerful sequence mappings such 
as algebraic transductions, which are the functional counterparts of context-free 
languages, or even more powerful ones. However, an essential need for kernels is 
an efficient computation, and more complex definitions would lead to substantially 
more costly computational complexities for kernel computation. For rational kernels, 
there exists a general and efficient computation algorithm. 


Computation We will assume that the transducer U defining a rational kernel 
kK does not admit any e-cycle with non-zero weight, otherwise the kernel value is 
infinite for all pairs. For any sequence 2, let T;, denote a weighted transducer with 
just one accepting path whose input and output labels are both x and its weight 
equal to one. T,, can be straightforwardly constructed from x in linear time O(|z|). 
Then, for any x,y € &*, U(x, y) can be computed by the following two steps: 


1. Compute V = T, oUoT, using the composition algorithm in time O(|U||Tz||T,|). 


2. Compute the sum of the weights of all accepting paths of V using a general 
shortest-distance algorithm in time O(|V]). 


By definition of composition, V is a weighted transducer whose accepting paths are 
precisely those accepting paths of U that have input label x and output label y. 
The second step computes the sum of the weights of these paths, that is, exactly 
U(a,y). Since U admits no e-cycle, V is acyclic, and this step can be performed in 
linear time. The overall complexity of the algorithm for computing U(z, y) is then 
in O(|U||Zx||Zy|). Since U is fixed for a rational kernel & and |T,| = O(|z|) for any 
x, this shows that the kernel values can be obtained in quadratic time O(|2||y|). 
For some specific weighted transducers U, the computation can be more efficient, 
for example in O(|z| + |y|) (see exercise 5.17). 


112 Kernel Methods 


PDS rational kernels For any transducer T, let T~! denote the inverse of T, 
that is the transducer obtained from T' by swapping the input and output labels of 
every transition. For all x,y, we have T~!(x,y) = T(y,). The following theorem 
gives a general method for constructing a PDS rational kernel from an arbitrary 
weighted transducer. 


Theorem 5.9 
For any weighted transducer T = (©, A,Q,I, F,E,p), the function K =ToT™? is 
a PDS rational kernel. 


Proof By definition of composition and the inverse operation, for all x,y € d*, 


K(a,y) = S- T(z, z) T(y, 2). 


zeA* 


K is the pointwise limit of the kernel sequence (K,)n>o defined by: 


Wn EN,Va,yed*, K,(2,y)= >> Te,z)T(y,2); 


lzl<n 


where the sum runs over all sequences in A* of length at most n. K, is PDS 
since its corresponding kernel matrix K,, for any sample (21,...,2%m) is SPSD. 
This can be see form the fact that K, can be written as K, = AA! with 
A = (Kn(i,2j))ie[tym),je[1,.N]) Where 21,...,2n is some arbitrary enumeration of 
the set of strings in &* with length at most n. Thus, Kk is PDS as the pointwise 
limit of the sequence of PDS kernels (Ky)nen. 


The sequence kernels commonly used in computational biology, natural language 
processing, computer vision, and other applications are all special instances of 
rational kernels of the form ToT~!. All of these kernels can be computed efficiently 
using the same general algorithm for the computational of rational kernels presented 
in the previous paragraph. Since the transducer U = T o T~! defining such PDS 
rational kernels has a specific form, there are different options for the computation 
of the composition T,, 0 U o Ty: 


= compute U = ToT™ first, then V = T, 0oU o Ty; 
= compute Vi = T, oT and V2 = To T first, then V = Vj 0 eae 
= compute first Vj = T, oT, then Vo = V; o T~', then V = V2 0 Ty, or the similar 


series of operations with « and y permuted. 


All of these methods lead to the same result after computation of the sum of the 
weights of all accepting paths, and they all have the same worst-case complexity. 
However, in practice, due to the sparsity of intermediate compositions, there may 
be substantial differences between their time and space computational costs. An 
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Figure 5.6 (a) Transducer Tyigram defining the bigram kernel Thigram°T yigram for © = 


{a,b}. (b) Transducer Tyappy_-bigram defining the gappy bigram kernel Tyappy_bigram © 
= with gap penalty » € (0, 1). 


gappy-bigram 


alternative method based on an n-way composition can further lead to significantly 
more efficient computations. 


Example 5.5 Bigram and gappy bigram sequence kernels 

Figure 5.6a shows a weighted transducer Thigram defining a common sequence 
kernel, the bigram sequence kernel, for the specific case of an alphabet reduced 
to © = {a,b}. The bigram kernel associates to any two sequences x and y the sum 
of the product of the counts of all bigrams in x and y. For any sequence x € &* and 
any bigram z € {aa, ab, ba, bb}, Thigram(x, z) is exactly the number of occurrences 
of the bigram z in x. Thus, by definition of composition and the inverse operation, 


—1 
Lhigram ° Terai 


computes exactly the bigram kernel. 

Figure 5.6b shows a weighted transducer Tyappy bigram defining the so-called gappy 
bigram kernel. The gappy bigram kernel associates to any two sequences x and y 
the sum of the product of the counts of all gappy bigrams in x and y penalized 
by the length of their gaps. Gappy bigrams are sequences of the form aua, aub, 
bua, or bub, where wu € %”* is called the gap. The count of a gappy bigram is 
multiplied by |u|* for some fixed A € (0,1) so that gappy bigrams with longer 
gaps contribute less to the definition of the similarity measure. While this definition 
could appear to be somewhat complex, figure 5.6 shows that Tyappy_bigram Can be 
straightforwardly derived from Thieram. The graphical representation of rational 
kernels helps understanding or modifying their definition. 


Counting transducers The definition of most sequence kernels is based on the 
counts of some common patterns appearing in the sequences. In the examples 
just examined, these were bigrams or gappy bigrams. There exists a simple and 
general method for constructing a weighted transducer counting the number of 
occurrences of patterns and using them to define PDS rational kernels. Let X be 
a finite automaton representing the set of patterns to count. In the case of bigram 
kernels with © = {a,b}, X would be an automaton accepting exactly the set of 
strings {aa, ab, ba, bb}. Then, the weighted transducer of figure 5.7 can be used to 
compute exactly the number of occurrences of each pattern accepted by X. 
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Figure 5.7 Counting transducer Teount for © = {a,b}. The “transition” X : X/1 
stands for the weighted transducer created from the automaton X by adding to 
each transition an output label identical to the existing label, and by making all 
transition and final weights equal to one. 


Theorem 5.10 
For any x € &* and any sequence z accepted by X, Teount(x, 2) is the number of 
occurrences of z in x. 


Proof Let x € &* be an arbitrary sequence and let z be a sequence accepted by 
X. Since all accepting paths of Toount have weight one, Teount(x, z) is equal to the 
number of accepting paths in Teount with input label x and output z. 

Now, an accepting path 7 in Teount With input x and output z can be decomposed 
as 7 = 7 701 77, Where 7 is a path through the loops of state 0 with input label 
some prefix 29 of x and output label €, 7p; an accepting path from 0 to 1 with input 
and output labels equal to z, and 7; a path through the self-loops of state 1 with 
input label a suffix x, of « and output e. Thus, the number of such paths is exactly 
the number of distinct ways in which we can write sequence x as © = X92 1, which 
is exactly the number of occurrences of z inv. 


The theorem provides a very general method for constructing PDS rational kernels 
TLeoint i that are based on counts of some patterns that can be defined 
via a finite automaton, or equivalently a regular expression. Figure 5.7 shows the 
transducer for the case of an input alphabet reduced to © = {a,b}. The general 
case can be obtained straightforwardly by augmenting states 0 and 1 with other 
self-loops using other symbols than a and b. In practice, a lazy evaluation can be 
used to avoid the explicit creation of these transitions for all alphabet symbols and 
instead creating them on-demand based on the symbols found in the input sequence 
x. Finally, one can assign different weights to the patterns counted to emphasize 
or deemphasize some, as in the case of gappy bigrams. This can be done simply by 
changing the transitions weight or final weights of the automaton X used in the 
definition of Teount- 
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5.6 Chapter notes 


The mathematical theory of PDS kernels in a general setting originated with the 
fundamental work of Mercer [1909] who also proved the equivalence of a condition 
similar to that of theorem 5.1 for continuous kernels with the PDS property. The 
connection between PDS and NDS kernels, in particular theorems 5.8 and _ 5.7, 
are due to Schoenberg [1938]. A systematic treatment of the theory of reproducing 
kernel Hilbert spaces was presented in a long and elegant paper by Aronszajn [1950]. 
For an excellent mathematical presentation of PDS kernels and positive definite 
functions we refer the reader to Berg, Christensen, and Ressel [1984], which is also 
the source of several of the exercises given in this chapter. 

The fact that SVMs could be extended by using PDS kernels was pointed out 
by Boser, Guyon, and Vapnik [1992]. The idea of kernel methods has been since 
then widely adopted in machine learning and applied in a variety of different tasks 
and settings. The following two books are in fact specifically devoted to the study 
of kernel methods: Schélkopf and Smola [2002] and Shawe-Taylor and Cristianini 
[2004]. The classical representer theorem is due to Kimeldorf and Wahba [1971]. 
A generalization to non-quadratic cost functions was stated by Wahba [1990]. The 
general form presented in this chapter was given by Schélkopf, Herbrich, Smola, 
and Williamson [2000]. 

Rational kernels were introduced by Cortes, Haffner, and Mohri [2004]. A general 
class of kernels, convolution kernels, was earlier introduced by Haussler [1999]. The 
convolution kernels for sequences described by Haussler [1999], as well as the pair- 
HMM string kernels described by Watkins [1999], are special instances of rational 
kernels. Rational kernels can be straightforwardly extended to define kernels for 
finite automata and even weighted automata [Cortes et al., 2004]. Cortes, Mohri, 
and Rostamizadeh [2008b] study the problem of learning rational kernels such as 
those based on counting transducers. 

The composition of weighted transducers and the filter transducers in the presence 
of e-paths are described in Pereira and Riley [1997], Mohri, Pereira, and Riley [2005], 
and Mohri [2009]. Composition can be further generalized to the N-way composition 
of weighted transducers [Allauzen and Mohri, 2009]. N-way composition of three 
or more transducers can substantially speed up computation, in particular for PDS 
rational kernels of the form ToT~!. A generic shortest-distance algorithm which can 
be used with a large class of semirings and arbitrary queue disciplines is described by 
Mohri [2002]. A specific instance of that algorithm can be used to compute the sum 
of the weights of all paths as needed for the computation of rational kernels after 
composition. For a study of the class of languages linearly separable with rational 
kernels , see Cortes, Kontorovich, and Mohri [2007a]. 
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5.7 Exercises 


5.1 Let K: X x & — R bea PDS kernel, and let a: ¥ — R be a positive function. 
Show that the kernel K’ defined for all x,y € ¥ by K'(x,y) = @&# is a PDS 
kernel. 


5.2 Show that the following kernels K are PDS: 


(a) K(a,y) = cos(# — y) over Rx R 

(b) K(x, y) = cos(a? — y?) over Rx R 

(c) K(x,y) = (a+ y)7! over (0,+00) x (0,+00). 

(d) K(x,x’) = cos Z(x,x’) over R” x R”, where Z(x,x’) is the angle between 
x and x’ 


(e) VA > 0, K(a,2’) = exp(— Afsin(x’ — «)]?) over R x R. (Hint: rewrite 
[sin(a’ — x)]? as the square of the norm of the difference of two vectors.) 


5.3 Show that the following kernels K are NDS: 


(a) K(zx,y) = [sin(# — y)|? over R x R. 
(b) K(x,y) = log(a + y) over (0,-++00) x (0,+00). 


5.4 Define a difference kernel as K(a,x’) = |x — 2'| for x,a2’ € R. Show that this 
kernel is not positive definite symmetric (PDS). 


5.5 Is the kernel K defined over R" x R” by K(x,y) = ||x—y||3/* PDS? Is it NDS? 


5.6 Let H be a Hilbert space with the corresponding dot product (-,-). Show that 
the kernel K defined over H x H by K(ax,y) =1-— (a,y) is negative definite. 


5.7 For any p > 0, let K, be the kernel defined over Ry x R+ by 
K,(2,y) =e" et. (5.21) 


Show that kK, is positive definite symmetric (PDS) iff p < 1. (Hint: you can use the 
fact that if kK is NDS, then for any 0 < a < 1, K® is also NDS.) 


5.8 Explicit mappings. 


(a) Denote a data set %1,...,%m and a kernel K(2;,x;) with a Gram matrix 
K. Assuming K is positive semidefinite, then give a map ®(-) such that 
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K (xj, £5) = (®(a;), ®(a;)). 
b) Show the converse of the previous statement, i.e., if there exists a mapping 
®(x) from input space to some Hilbert space, then the corresponding matrix 
K is positive semidefinite. 


5.9 Explicit polynomial kernel mapping. Let K be a polynomial kernel of degree d, 
ie., K: RN xRN > R, K(x,x’) = (x-x’+c)4, with c > 0, Show that the dimension 
of the feature space associated to K is 


(" - "). (5.22) 


Write K in terms of kernels kj: (x,x’) + (x-x’)', i € [0,d]. What is the weight 
assigned to each k; in that expression? How does it vary as a function of c? 


5.10 High-dimensional mapping. Let ®: ¥ — H be a feature mapping such that 
the dimension N of H is very large and let kK: VY x ¥ — R be a PDS kernel defined 
by 

K(x,2') = EB [[®(a)]{®(2’)]], (5.23) 
where [®(x)]; is the ith component of ®(x) (and similarly for ®’(a)) and where 
D is a distribution over the indices i. We shall assume that |[®(a)];| < R for all 
x €X andi € [1, N]. Suppose that the only method available to compute K (x, x’) 
involved direct computation of the inner product (5.23), which would require O(N) 
time. Alternatively, an approximation can be computed based on random selection 
of a subset I of the N components of ®(x) and ®(z’) according to D, that is: 


K'(a,0') = ~ PMO), (5.24) 
iel 


where |I| =n. 
(a) Fix # and a’ in X. Prove that 


—ne2 
Pt, l|K(@,2) — K'(a,2')| > | < 267 (5.25) 
(Hint: use McDiarmid’s inequality). 
(b) Let K and K’ be the kernel matrices associated to K and K’. Show 
2 
that for any €,6 > 0, forn > Slog mmr) with probability at least 1 — 6, 


€ 


|Ki; — Kj, | < e for all 47) E (1, mJ. 


5.11 Classifier based kernel. Let S be a training sample of size m. Assume that 
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S has been generated according to some probability distribution D(x, y), where 
(x,y) € X x {=1, +1}. 


(a) Define the Bayes classifier h*: X — {—1,+1}. Show that the kernel k* 
defined by K*(x,a’) = h*(ax)h*(ax') for any x,x2’ € X is positive definite 
symmetric. What is the dimension of the natural feature space associated to 
K*? 

(b) Give the expression of the solution obtained using SVMs with this kernel. 
What is the number of support vectors? What is the value of the margin? What 
is the generalization error of the solution obtained? Under what condition are 
the data linearly separable? 


(c) Let h: X — R be an arbitrary real-valued function. Under what condition 
on h is the kernel K defined by K(a,2’) = h(x)h(2’), x,” € X, positive 
definite symmetric? 


5.12 Image classification kernel. For a > 0, the kernel 
N 
Ka: (x, x’) S© min(|ee|*, |2%|%) (5.26) 
k=1 


over RN x RN is used in image classification. Show that Ky, is PDS for all a > 0. 
To do so, proceed as follows. 


(a) Use the fact that (f,g) > f f(t)g(t)dt is an inner product over the set 
of measurable functions over [0,+00) to show that (a,2’) + min(z,2’) is a 
PDS kernel. (Hint: associate an indicator function to x and another one to 2’.) 


(b) Use the result from (a) to first show that Ay is PDS and similarly that Kq 
with other values of a is also PDS. 


5.13 Fraud detection. To prevent fraud, a credit-card company decides to contact 
Professor Villebanque and provides him with a random list of several thousand 
fraudulent and non-fraudulent events. There are many different types of events, 
e.g., transactions of various amounts, changes of address or card-holder information, 
or requests for a new card. Professor Villebanque decides to use SVMs with an 
appropriate kernel to help predict fraudulent events accurately. It is difficult for 
Professor Villebanque to define relevant features for such a diverse set of events. 
However, the risk department of his company has created a complicated method to 
estimate a probability Pr[U] for any event U. Thus, Professor Villebanque decides 
to make use of that information and comes up with the following kernel defined 
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over all pairs of events (U,V): 
K(U,V) = Pr[U A V] — Pr[U] Pr{V]. (5.27) 


Help Professor Villebanque show that his kernel is positive definite symmetric. 


5.14 Relationship between NDS and PDS kernels. Prove the statement of theo- 
rem 5.7. (Hint: Use the fact that if k is PDS then exp() is also PDS, along with 
theorem 5.6.) 


5.15 Metrics and Kernels. Let ¥ be a non-empty set and kK: 4% x X — Rbea 
negative definite symmetric kernel such that K(x,x) = 0 for alla € X. 


(a) Show that there exists a Hilbert space H and a mapping ®(x) from 4 to 
HH such that: 


K(2,y) = ||®(2) — ®(@’)|/*. 


Assume that K(zx,2’) =0 > x = a’. Use theorem 5.6 to show that VK defines 
a metric on ¥. 

(b) Use this result to prove that the kernel K(x, y) = exp(—|rv—2'|?), 2,2’ ER, 
is not positive definite for p > 2. 

(c) The kernel K (x, x’) = tanh(a(x-a")+6) was shown to be equivalent to a two- 
layer neural network when combined with SVMs. Show that K is not positive 
definite if a < 0 or b < 0. What can you conclude about the corresponding 
neural network when a < 0 or b < 0? 


5.16 Sequence kernels. Let X = {a,c,g,t}. To classify DNA sequences using SVMs, 
we wish to define a kernel between sequences defined over X. We are given a finite 
set I C X* of non-coding regions (introns). For « € X*, denote by |z| the length 
of x and by F(a) the set of factors of a, i.e., the set of subsequences of 2 with 
contiguous symbols. For any two strings x,y € X* define K(x, y) by 


K(a.y) = s pil, (5.28) 
z€(F(2)NF(y))—-I 


where p > 1 is a real number. 


(a) Show that K is a rational kernel and that it is positive definite symmetric. 


(b) Give the time and space complexity of the computation of K(z,y) with 
respect to the size s of a minimal automaton representing X* — I. 


(c) Long common factors between x and y of length greater than or equal to 
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n are likely to be important coding regions (exons). Modify the kernel Kk to 
lz] 


assign weight pl to z when |z| >, p;' otherwise, where 1 < p; < p2. Show 
that the resulting kernel is still positive definite symmetric. 


5.17 n-gram kernel. Show that for all n > 1, and any n-gram kernel K,,, Ky (x,y) 
can be computed in linear time O(|| + |y|), for all x,y € &* assuming n and the 
alphabet size are constants. 


5.18 Mercer’s condition. Let 4% C R% be a compact set and K:¥ x X > Ra 
continuous kernel function. Prove that if K verifies Mercer’s condition (theorem 5.1), 
then it is PDS. (Hint: assume that K is not PDS and consider a set {21,...,@m}C 


& and a column-vector c € R™*? such that )77",_) cic; K (ai, 23) < 0.) 


6 Boosting 


Ensemble methods are general techniques in machine learning for combining several 
predictors to create a more accurate one. This chapter studies an important family of 
ensemble methods known as boosting, and more specifically the AdaBoost algorithm. 
This algorithm has been shown to be very effective in practice in some scenarios and 
is based on a rich theoretical analysis. We first introduce AdaBoost, show how it can 
rapidly reduce the empirical error as a function of the number of rounds of boosting, 
and point out its relationship with some known algorithms. Then we present a 
theoretical analysis of its generalization properties based on the VC-dimension of 
its hypothesis set and based on a notion of margin that we will introduce. Much of 
that margin theory can be applied to other similar ensemble algorithms. A game- 
theoretic interpretation of AdaBoost further helps analyzing its properties. We end 
with a discussion of AdaBoost’s benefits and drawbacks. 


6.1 Introduction 


It is often difficult, for a non-trivial learning task, to directly devise an accurate 
algorithm satisfying the strong PAC-learning requirements of chapter 2. But, there 
can be more hope for finding simple predictors guaranteed only to perform slightly 
better than random. The following gives a formal definition of such weak learners. 


Definition 6.1 Weak learning 
A concept class C is said to be weakly PAC-learnable if there exists an algorithm 
A, y > 0, and a polynomial function poly(-,-,-,+) such that for any e >0 and 6 > 0, 
for all distributions D on & and for any target concept c € C, the following holds 
for any sample size m > poly(1/e,1/6,n, size(c)): 

Pr, [Rlts) < 5-7] 21-4 (6.1) 


When such an algorithm A exists, it is called a weak learning algorithm for C' or a 
weak learner. The hypotheses returned by a weak learning algorithm are called base 
classifiers. 
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ADABoostT(S = ((x1, Yi); resus (igs Ym))) 


1 fori<1to mdo 


m 


for t+ 1to T do 


ht — base classifier in H with small error €, = Privp,[hi(xi) ¥ yi] 


1l-e& 


1 
at — 3 log = 


Z, — 2e(1 —)]2 > normalization factor 
for i 1to mdo 
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Figure 6.1 AdaBoost algorithm for H C {-1,+1}*. 


The key idea behind boosting techniques is to use a weak learning algorithm 
to build a strong learner, that is, an accurate PAC-learning algorithm. To do so, 
boosting techniques use an ensemble method: they combine different base classifiers 
returned by a weak learner to create a more accurate predictor. But which base 
classifiers should be used and how should they be combined? The next section 
addresses these questions by describing in detail one of the most prevalent and 
successful boosting algorithms, AdaBoost. 


6.2. AdaBoost 


We denote by H the hypothesis set out of which the base classifiers are selected. 
Figure 6.1 gives the pseudocode of AdaBoost in the case where the base classifiers 
are functions mapping from ¥ to {—1,+1}, thus H C {—1,+1}*. 

The algorithm takes as input a labeled sample S = ((#1,y1),---,(@m;Ym)), with 
(ai,yi) € X x {-1,+1} for all ¢ € [1,mJ], and maintains a distribution over the 
indices {1,...,m}. Initially (lines 1-2), the distribution is uniform (D,). At each 
round of boosting, that is each iteration t € [1, T] of the loop 3-8, a new base classifier 


hy € H is selected that minimizes the error on the training sample weighted by the 
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updated 
weights 
t=l t=2 t = 3 


(a) 
tf He : 
(b) 


Figure 6.2 Example of AdaBoost with axis-aligned hyperplanes as base learners. 
(a) The top row shows decision boundaries at each boosting round. The bottom row 
shows how weights are updated at each round, with incorrectly (resp., correctly) 
points given increased (resp., decreased) weights. (b) Visualization of final classifier, 
constructed as a linear combination of base learners. 


decision 
boundary 


distribution D;: 


m 
hy € argmin Pr [hi(2;) 4 y;] = argmin ) > D (2) Lite datas: 
nceH *~Dt heH ‘4 
Z, is simply a normalization factor to ensure that the weights D;,+,1(2) sum to one. 
The precise reason for the definition of the coefficient a; will become clear later. For 
now, observe that if €,, the error of the base classifier, is less than 1/2, then att Sul 
and a; > 0. Thus, the new distribution D,41 is defined from D; by substantially 
increasing the weight on i if point x; is incorrectly classified (yjh:(a;) <0), and, on 
the contrary, decreasing it if x; is correctly classified. This has the effect of focusing 
more on the points incorrectly classified at the next round of boosting, less on those 
correctly classified by h;. 
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After T' rounds of boosting, the classifier returned by AdaBoost is based on the 
sign of function g, which is a linear combination of the base classifiers h;. The weight 
a, assigned to h; in that sum is a logarithmic function of the ratio of the accuracy 
1 — & and error e; of h;. Thus, more accurate base classifiers are assigned a larger 
weight in that sum. Figure 6.2 illustrates the AdaBoost algorithm. The size of the 
points represents the distribution weight assigned to them at each boosting round. 

For any t € [1, T], we will denote by g; the linear combination of the base classifiers 
after t rounds of boosting: f; = et arhz. In particular, we have gr = g. The 
distribution D;,, can be expressed in terms of g; and the normalization factors Z,, 
s € {1,¢], as follows: 


7 Yi gt (wi) 
= rar 
m [1,345 


We will make use of this identity several times in the proofs of the following sections. 
It can be shown straightforwardly by repeatedly expanding the definition of the 
distribution over the point 2;: 


Vie [l,m], Deyi(i) = (6.2) 


D,(i)e~ eyehe (wa) Dyz-1 (ie M-1ysht-1 (Ge) eee yee (@:) 
Pea = z . Zk, 
envi ey ashs(xi) 
7 mT Zs 


The AdaBoost algorithm can be generalized in several ways: 


# instead of a hypothesis with minimal weighted error, h; can be more generally 
the base classifier returned by a weak learning algorithm trained on D;; 


= the range of the base classifiers could be [—1,+1], or more generally R. The 
coefficients a; can then be different and may not even admit a closed form. In 
general, they are chosen to minimize an upper bound on the empirical error, as 
discussed in the next section. Of course, in that general case, the hypothesis h; are 
not binary classifiers, but the sign of their values could indicate the label, and their 
magnitude could be interpreted as a measure of confidence. 


In the remainder of this section, we will further analyze the properties of Ad- 
aBoost and discuss its typical use in practice. 


6.2.1 Bound on the empirical error 


We first show that the empirical error of AdaBoost decreases exponentially fast as 
a function of the number of rounds of boosting. 
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Theorem 6.1 
The empirical error of the classifier returned by AdaBoost verifies: 


t=1 
Furthermore, if for allt € (1,T] (5 —e), then 
R(h) < exp(-2Y°T). (6.4) 


Proof Using the general inequality 1,<o < exp(—u) valid for all u € R and 
identity 6.2, we can write: 


m m f B 
1 


. T 
~ m=! yig(ai)<o S 3 eats) — a oy nT] z, Dr4i(i) = i Te. 


i=l i=1 t=1 


Since, for all ¢ € [1,7], Z is a normalization factor, it can be expressed in terms of 
€, by: 


=> Di(jerovn@ = So Dilije* + SD Die 
i=l 


t:yihe (ai )=4+1 tryihe(ai)=—-1 
(l-e)e"™ + ee™ 


Thus, the product of the normalization factors can be expressed and upper bounded 


as follows: 
T 


1%- [eveti=a)- Iyi- G =a) < [To sles 


where the inequality follows from the identity 1-a<e~* valid foralzxeR. 


Note that the value of 7, which is known as the edge, and the accuracy of the base 
classifiers do not need to be known to the algorithm. The algorithm adapts to their 
accuracy and defines a solution based on these values. This is the source of the 
extended name of AdaBoost: adaptive boosting. 

The proof of theorem 6.1 reveals several other important properties. First, observe 
that a, is the minimizer of the function g: a + (1 — e&)e~* + ee®. Indeed, g is 
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Figure 6.3 Visualization of the zero-one loss (blue) and the convex and differen- 
tiable upper bound on the zero-one loss (red) that is optimized by AdaBoost. 


convex and differentiable, and setting its derivative to zero yields: 


1 1- 
g(a)=-(-e)e"+ee% =06 (l-a)e * =Ee* Sa= 5 log ; s (6.5) 
t 


Thus, a; is chosen to minimize Z; = g(az), and in light of the bound R(h) < iN ee Zt 
shown in the proof, these coefficients are selected to minimize an upper bound on 
the empirical error. In fact, for base classifiers whose range is [—1,+1] or R, a; 
can be chosen in a similar fashion to minimize Z;, and this is the way AdaBoost is 
extended to these more general cases. 

Observe also that the equality (1 — e&,)e~°* = e,e% just shown in (6.5) implies 
that at each iteration, AdaBoost assigns equal distribution mass to correctly and 
incorrectly classified instances, since (1—e€,)e~* is the total distribution assigned to 
correctly classified points and e,e% that of incorrectly classified ones. This may seem 
to contradict the fact that AdaBoost increases the weights of incorrectly classified 
points and decreases that of others, but there is in fact no inconsistency: the reason 
is that there are always fewer incorrectly classified points, since the base classifier’s 
accuracy is better than random. 


6.2.2 Relationship with coordinate descent 


AdaBoost was designed to address a novel theoretical question, that of designing a 
strong learning algorithm using a weak learning algorithm. We will show, however, 
that it coincides in fact with a very simple and classical algorithm, which consists 
of applying a coordinate descent technique to a convex and differentiable objective 
function. The objective function F' for AdaBoost is defined for all samples S = 
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((@1, Y1),---,(%m;Ym)) and @ = (a1,...,A,) € R", n> 1, by 


m 


= yi gn (i) Ye Yi Dotan Oehre (a) (6.6) 


where gn = yo a,h,. This function is an upper bound on the zero-one loss 
function we wish to minimize, as shown in figure 6.3. Let e, denote the unit vector 
corresponding to the tth coordinate in R” and let a; , denote the vector based 
on the (t — 1) first coefficients, ie. a¢_1 = (Q1,...,Q¢-1,0,...,0)' if f-—1 > 0, 
Q,—1 = 0 otherwise. At each iteration t > 1, the direction e; selected by coordinate 
descent is the one minimizing the directional derivative: 


dF(ay-1 + ney) 
dn 


e, = argmin 

t n=0 
Since F(a,_1 + ner) = oy, e % Dent ashs(vi)—yinh+(@i) the directional derivative 
along e; can be expressed as follows: 


y yiht(x;) exp — Yi y ashs(xi)| 
= — Lot (xj) D(a com 12 
= . >> D,i)— Dali) [m Il Z,] 


i:yshe(ai)=+1 i:yihe(ai)=—-1 
—[(1 — &) — &] iG = a — uf [1 2) 


The first equality holds by differentiation and evaluation at 7 = 0, and the second 
one follows from (6.2). The third equality divides the sample set into points correctly 
and incorrectly classified by h;, and the fourth equality uses the definition of e,. In 
view of the final equality, since mI} Z, is fixed, the direction e; selected by 
coordinate descent is the one minimizing €;, which corresponds exactly to the base 
learner h; selected by AdaBoost. 

The step size 7 is identified by setting the derivative to zero in order to minimize 
the function in the chosen direction e;. Thus, using identity 6.2 and the definition 


dF (ay_1 + ney) 
dn 


n=0 
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Figure 6.4 Examples of several convex upper bounds on the zero-one loss. 


of €,, we can write: 


dF (ay-1 + ner) 


m t—1 
=0 S- yihe (wi) exp | - vu >, ashs(xi)| eva 0 
i=1 s=1 


dn 
m t-1 
€ — So vihe(ws) Deli) [mT] Z.Je% 9" = 0 
i=1 s=1 
6 — So yshu (i) De(d)e-¥**" = 
i=1 
= —[(1-—e)e~” — ee"| = 0 
_ 1 16 1- Et 
= 2 8 €t , 


This proves that the step size chosen by coordinate descent matches the base 
classifier weight a, of AdaBoost. Thus, coordinate descent applied to F precisely 
coincides with the AdaBoost algorithm. 

In light of this relationship, one may wish to consider similar applications of 
coordinate descent to other convex and differentiable functions of a upper-bounding 
the zero-one loss. In particular, the logistic loss 7 + log,(1 + e~*) is convex and 
differentiable and upper bounds the zero-one loss. Figure 6.4 shows other examples 
of alternative convex loss functions upper-bounding the zero-one loss. Using the 
logistic loss, instead of the exponential loss used by AdaBoost, leads to an algorithm 
that coincides with logistic regression. 
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6.2.3. Relationship with logistic regression 


Logistic regression is a widely used binary classification algorithm, a specific instance 
of conditional maximum entropy models. For the purpose of this chapter, we 
first give a very brief description of the algorithm. Logistic regression returns a 
conditional probability distribution of the form 


1 
Paly|a] = Fa ee B(x), (6.7) 
where y € {—1,+1} is the label predicted for x € ¥, ®(x) € RN is a feature vector 
associated to 2, a € RN a parameter weight vector, and Z(x) = et®®@) 4 e—v 8) 
a normalization factor. Dividing both the numerator and denominator by e+®®@) 
and taking the log leads to: 


log(palyl2]) = —log(1 + e777 ?™)., (6.8) 


The parameter q@ is learned via maximum likelihood by logistic regression, that is, 
by maximizing the probability of the sample S = ((21,ys),---,;(@m;Ym)). Since the 
points are sampled i.i.d., this can be written as follows: argmax, |}, palyi|zil. 
Taking the negative log of the probabilities shows that the objective function 
minimized by logistic regression is 


m 
Ga) = SJ log(1 + eo PCs) , (6.9) 
i=1 
Thus, modulo constants, which do not affect the solution sought, the objective 
function coincides with the one based on the logistic loss. AdaBoost and Logistic 
regression have in fact many other relationships that we will not discuss in detail 
here. In particular, it can be shown that both algorithms solve exactly the same 
optimization problem, except for a normalization constraint required for logistic 
regression not imposed in the case of AdaBoost. 


6.2.4 Standard use in practice 


Here we briefly describe the practical use of AdaBoost. An important requirement 
for the algorithm is the choice of the base classifiers or that of the weak learner. The 
family of base classifiers typically used with AdaBoost in practice is that of decision 
trees, which are equivalent to hierarchical partitions of the space (see chapter 8, 
section 8.3.3). In fact, more precisely, decision trees of depth one, also known as 
stumps or boosting stumps are by far the most frequently used base classifiers. 
Boosting stumps are threshold functions associated to a single feature. Thus, 
a stump corresponds to a single axis-aligned partition of the space, as illustrated 
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Figure 6.5 An empirical result using AdaBoost with C4.5 decision trees as base 
learners. In this example, the training error goes to zero after about 5 rounds of 
boosting (J ~ 5), yet the test error continues to decrease for larger values of T. 


in figure 6.2. If the data is in R%, we can associate a stump to each of the N 
components. Thus, to determine the stump with the minimal weighted error at 
each of round of boosting, the best component and the best threshold for each 
component must be computed. 

To do so, we can first presort each component in O(mlogm) time with a total 
computational cost of O(mN log m). For a given component, there are only m+ 1 
possible distinct thresholds, since two thresholds between the same consecutive 
component values are equivalent. To find the best threshold at each round of 
boosting, all of these possible m+ 1 values can be compared, which can be done in 
O(m) time. Thus, the total computational complexity of the algorithm for T rounds 
of boosting is O(mN logm +mNT). 

Observe, however, that while boosting stumps are widely used in combination 
with AdaBoost and can perform well in practice, the algorithm that returns the 
stump with the minimal empirical error is not a weak learner (see definition 6.1)! 
Consider, for example, the simple XOR example with four data points lying in 
IR? (see figure 5.2a), where points in the second and fourth quadrants are labeled 
positively and those in the first and third quadrants negatively. Then, no decision 
stump can achieve an accuracy better than 1/2. 


6.3. Theoretical results 


In this section we present a theoretical analysis of the generalization properties of 
AdaBoost. 


6.3 Theoretical results 131 


6.3.1 VC-dimension-based analysis 


We start with an analysis of AdaBoost based on the VC-dimension of its hypothesis 
set. For T rounds of boosting, its hypothesis set is 


TT! 
Fr = {sen (Srarhe): a €R hee Hite [1,7}}, (6.10) 
t=1 


The VC-dimension of Fr can be bounded as follows in terms of the VC-dimension 
d of the family of base hypothesis H (exercise 6.1): 


VCdim(Fr) < 2(d+ 1)(T + 1) log, ((T + Le). (6.11) 


The upper bound grows as O(dT logT), thus the bound suggests that AdaBoost 
could overfit for large values of 7’, and indeed this can occur. However, in many 
cases, it has been observed empirically that the generalization error of AdaBoost 
decreases as a function of the number of rounds of boosting 7’, as illustrated in 
figure 6.5! How can these empirical results be explained? The following sections 
present an analysis based on a concept of margin, similar to the one presented for 
SVMs. 


6.3.2 Margin-based analysis 


In chapter 4 we gave a definition of margin for linear classifiers such as SVMs 
(definition 4.2). Here we will need a somewhat different but related definition of 
margin for linear combinations of base classifiers, as in the case of AdaBoost. 

First note that a linear combination of base classifiers g = aoe azh, can be 
defined equivalently via g(x) = a-h(z) for all x € ¥, with a=(a1,...,a7r)' and 
h(x) = [hi(x),...,hr(x)]'. This makes their similarity with the linear hypotheses 
considered in chapter 4 and chapter 5 evident: h(x) is the feature vector associated 
to x, which was previously denoted by ®(a), and @ is the weight vector that was 
denoted by w. The base classifiers values h;(a) are the components of the feature 
vector associated to x. For AdaBoost, additionally, the weight vector is non-negative: 
a> 0. 

We will use the same notation to introduce the following definition. 


Definition 6.2 [,-margin 

The Ly-margin p(x) of a point « € X with label y € {—1,+1} for a linear 
combination of base classifiers g = ee arh, with a 4 0 and hk € H for all 
t € [1,T] is defined as 


yy — _ yg{z) ye ahe(e) a h(2) 
oe) = Sm a] leh Nel 


(6.12) 
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The L,-margin of a linear combination classifier g with respect to a sample S = 
(%1,..-,Lm) ts the minimum margin of the points within the sample: 


min ——————— 
ie{tm] 2 fafa 


p= (6.13) 
When the coefficients a, are non-negative, as in the case of AdaBoost, p(x) is a 
convex combination of the base classifier values h;(x). In particular, if the base 
classifiers h, take values in [—1,+1], then p(a) is in [—1,+1]. The absolute value 
|p(x)| can be interpreted as the confidence of the classifier g in that label. 

This definition of margin differs from definition 4.2 given for linear classifiers only 
by the norm used for the weight vector: LZ; norm here, Lz norm in definition 4.2. 
Indeed, in the case of a linear hypothesis x +> w-©®(x), the margin for point x with 
label y was defined as follows: 


and was based on the Lz norm of w. When the prediction is correct, that is 
y(a-h(«x)) > 0, the Ly-margin and Lz margin of definition 4.2 can be rewritten as 


_ la h(a)) 


eee) gee 


pul) = Tals 


It is known that for p,q > 1, p and q conjugate, i.e. 1/p+1/q =1, that |a-x|/||a||p 
is the L, distance of x to the hyperplane of equation a-x = 0. Thus, both p;(x) and 
p2(x) measure the distance of the feature vector h(a) to the hyperplane a-x = 0 in 
R?, pi(z) its || - ||oo distance, p2(z) its || - ||2 or Euclidean distance (see figure 6.6). 

To examine the generalization properties of AdaBoost, we first analyze the 
Rademacher complexity of convex combinations of hypotheses such as those defined 
by AdaBoost. Next, we use the margin-based analysis from chapter 4 to derive a 
margin-based generalization bound for boosting with the definition of margin just 
introduced. 

For any hypothesis set H of real-valued functions, we denote by conv(#) its 
convex hull defined by 


P P 
conv(H) = {Saat p>1,Vk € [1, pl, ur > 0, he € HS) ur < it. (6.14) 
k=1 k=1 
The following theorem shows that, remarkably, the empirical Rademacher complex- 
ity of conv(H), which in general is a strictly larger set including H, coincides with 
that of H. 
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Theorem 6.2 

Let H be a set of functions mapping from * to R. Then, for any sample S, we have 
Rs(conv(H)) =Rs(H). 


Proof The proof follows from a straightforward series of equalities: 


<5 | sup Yoo Hata Ly ) 


hiy-hp€H,p>0, || el]1<1 54 


= Fel sw si Ya(Soavtale) 


hi,..hp CH p>0, act 7 = 


me 


Rs5(conv(H)) = 


1 

=—E ojhp (xi) 
mo oe a a Y epkelt, a “(2 . | 
1 m 

=—E}] su ojh(a; R 
melee d («s)] = is(z), 


The main equality to recognize is the third one, which is based on the observation 
that the maximizing vector w for the convex combination of p terms is the one 
placing all the weight on the largest term of the sum. m 


This theorem can be used directly in combination with theorem 4.4 to derive 
the following Rademacher complexity generalization bound for convex combination 
ensembles of hypotheses. 


Corollary 6.1 Ensemble Rademacher margin bound 
Let H denote a set of real-valued functions. Fiz p > 0. Then, for any 6 > 0, with 
probability at least 1 — 6, each of the following holds for all h € conv(H): 


n 2 log 4 

Rh) < Ry(h) + 2m (H) + — (6.15) 
= ee log = 

R(h) < Ry(h) + 57s (H) Pe (6.16) 


Using corollary 3.1 and corollary 3.3 to bound the Rademacher complexity in 
terms of the VC-dimension yields immediately the following VC-dimension-based 
generalization bounds for convex combination ensembles of hypotheses. 


Corollary 6.2 Ensemble VC-Dimension margin bound 
Let H be a family of functions taking values in {+1,—1} with VC-dimension d. Fix 
p> 0. Then, for any 6 > 0, with probability at least 1 — 6, the following holds for 
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all h € conv(H): 


se 2 /2dlog log + 
SS : 7 
R(h) < Rp(h) a a a (6.17) 


These bounds can be generalized to hold uniformly for all p > 0, instead of a fixed 
p, at the price of an additional term of the form ,/ (log log, 2)/m as in theorem 4.5. 
They cannot be directly applied to the linear combination g generated by AdaBoost, 
since it is not a convex combination of base hypotheses, but they can be applied to 
the following normalized version of g: 


= g(x) -_ <a ah; (x) 
l|orl1 lol} 


€ conv(H). (6.18) 


Note that from the point of view of binary classification, g and g/||a||, are equivalent 
since sgn(g) = sgn(g/|la||,), thus R(g) = R(g/||e||1), but their empirical margin 
loss are distinct. Let g = )>,_, a¢h; denote the function defining the classifier 
returned by AdaBoost after T rounds of boosting when trained on sample S. Then, 
in view of (6.15), for any 6 > 0, the following holds with probability at least 1 — 0d: 


R(g) < Rglg/lerla) + °%m(H) + (6.19) 
Similar bounds can be derived from (6.16) and (6.17). Remarkably, the number 
of rounds of boosting T does not appear in the generalization bound (6.19). The 
bound depends only on the margin p, the sample size m, and the Rademacher 
complexity of the family of base classifiers H. Thus, the bound guarantees an 
effective generalization if the margin loss Ro(g /||@||1) is small for a relatively large 
p. Recall that the margin loss can be upper bounded by the fraction of the points 
x in the training sample with g(x) /||a||, > p (see (4.39)). Thus, with our definition 
of Ly-margin, it can be bounded by the fraction of the points in S with D,-margin 
more than p: 


R,(g/\lall) < 1é € [1m]: pli) 2 pt] 


(6.20) 


Additionally, the following theorem provides a bound on the empirical margin loss, 
which decreases with T under conditions discussed later. 


Theorem 6.3 
Let g = ss arh, denote the function defining the classifier returned by AdaBoost 


after T rounds of boosting and assume for allt € [1,T] that & < 4 


3, which implies 
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at > 0. Then, for any p > 0, the following holds: 


T 
®, (7S) <2" T] vera a). 
Q|/1 
t=1 


Proof Using the general inequality 1,<9 < exp(—u) valid for all u € R, iden- 


tity 6.2, that is Dy,1(i) = A the equality Z, = 2,/e(1 — e,) from the proof 


of theorem 6.1, and the definition of a in AdaBoost, we can write: 


1 m 1 m 
mm 2 tae)—allatso S 5, 2 exP(—ve9(%4) + allah) 


m T 
1 . 
=— x efllala mJ] 2, Dr+i(t) 
i=l t=1 
T 7 
— ePllells I Z, = cP im II g; 
t=1 t=1 
: p 
=27|| lV Ls ex(1—e), 


t=1 


which concludes the proof. 


Moreover, if for all t € [1,7] we have y < ($ — &) and p < 24, then the expression 


de; °(1—e) +? is maximized at & = t_»J Thus, the upper bound on the empirical 
margin loss can then be bounded by 


iB T/2 
R (727) < l(a ~2y)!-P(1 + ay) (6.21) 
lols 
Observe that (1 — 2y)!~°(1 + 2y)!t? = (1 — 4y Wes) This is an increasing 
function of p since we have (7) > 1 as a consequence of 7 > 0. Thus, if p < 7, 
it can be strictly upper bounded as follows 


es) md ON oo) cada a ey) alls Oe 3) aaa 


The function y+ (1 — 2y)'~7(1 + 2y)1*7 is strictly upper bounded by 1 over the 
interval (0, 1/2), thus, if p < y, then (1—2y)!~°(1+2y)!* <1 and the right-hand 
side of (6.21) decreases exponentially with T. Since the condition p >> O(1/./m) is 
necessary in order for the given margin bounds to converge, this places a condition 


1. The differential of f: €++ log[e'~°(1 — 6)'*?] = le p) loge + (1+ p) log(1 — €) over the 


1- 1+ (3-§)-e 
= 1-— e=2 ear 


over (0, 4 — §), which implies that it is increasing over (0,3 —y) when y > § 


interval (0, 1) is given by f'(e) Thus, f is an increasing function 
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Figure 6.6 Maximum margins with respect to both the Lz and L. norm. 


of y > O(1/./m) on the edge value. In practice, the error €; of the base classifier at 
round t may increase as a function of t. Informally, this is because boosting presses 
the weak learner to concentrate on instances that are harder and harder to classify, 
for which even the best base classifier could not achieve an error significantly better 
than random. If e€, becomes close to 1/2 relatively fast as a function of t, then the 
bound of theorem 6.3 becomes uninformative. 

The margin bounds of corollary 6.1 and corollary 6.2, combined with the bound 
on the empirical margin loss of theorem 6.3, suggest that under some conditions, 
AdaBoost can achieve a large margin on the training sample. They could also serve 
as a theoretical explanation of the empirical observation that in some tasks the 
generalization error increases as a function of T’ even after the error on the training 
sample is zero: the margin would continue to increase. But does AdaBoost maximize 
the L,-margin? 

No. It has been shown that AdaBoost may converge to a margin that is signifi- 
cantly smaller than the maximum margin (e.g., 1/3 instead of 3/8). However, under 
some general assumptions, when the data is separable and the base learners satisfy 
particular conditions, it has been proven that AdaBoost can asymptotically achieve 
a margin that is at least half the maximum margin, Pmax/2. 


6.3.3. Margin maximization 


In view of these results, several algorithms have been devised with the explicit goal 
of maximizing the L,-margin. These algorithms correspond to different methods for 
solving a linear program (LP). 

By definition of the Lyi-margin, the maximum margin for a sample S = 
((@1, 41); es (Im, Ym)) is given by 


. a-h(2;) 
= max min 5; 
OTe ea elle 


(6.22) 
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By definition of the maximization, the optimization problem can be written as: 
max p 
a 


hie 
subject to: i (xi) 


loll: > p, Vi € [1, ml]. 


Since oh 

Ql} 
Further seeking a non-negative @ as in the case of AdaBoost leads to the following 
optimization: 


is invariant to the scaling of a, we can restrict ourselves to |/a||, = 1. 


max /~ 
a 


subject to: ys(a@-h(a;)) > p, Vi € [1,m] 
uy 
(Sa = 1) A (az > 0, Vt € [1, T]). 
t=1 


This is a linear program (LP), that is, an optimization problem with a linear 
objective function and linear constraints. There are several different methods for 
solving relative large LPs in practice, using the simplex method, interior-point 
methods, or a variety of special-purpose solutions. 

Note that the solution of this algorithm differs from the margin-maximization 
defining SVMs in the separable case only by the definition of the margin used (L, 
versus Lz) and the non-negativity constraint on the weight vector. Figure 6.6 illus- 
trates the margin-maximizing hyperplanes found using these two distinct margin 
definitions in a simple case. The left figure shows the SVM solution, where the dis- 
tance to the closest points to the hyperplane is measured with respect to the norm 
|| - |l2. The right figure shows the solution for the Lj-margin, where the distance to 
the closest points to the hyperplane is measured with respect to the norm || - ||oo. 

By definition, the solution of the LP just described admits an Li-margin that 
is larger or equal to that of the AdaBoost solution. However, empirical results do 
not show a systematic benefit for the solution of the LP. In fact, it appears that in 
many cases, AdaBoost outperforms that algorithm. The margin theory described 
does not seem sufficient to explain that performance. 


6.3.4 Game-theoretic interpretation 


In this section, we first show that AdaBoost admits a natural game-theoretic 
interpretation. The application of von Neumann’s theorem then helps us relate the 
maximum margin and the optimal edge and clarify the connection of AdaBoost’s 
weak-learning assumption with the notion of L,-margin. We first introduce the 
definition of the edge for a specific classifier and a particular distribution. 
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rock | paper | scissors 
rock 0 +1 -1 
paper -1 0 +1 
scissors | +1 -1 0 


Table 6.1 The loss matrix for the standard rock-paper-scissors game. 


Definition 6.3 
The edge of a base classifier h; for a distribution D over the training sample is 
defined by 


y(D) ==-@ =5 di yile(ns) Dl). (6.23) 


AdaBoost’s weak learning condition can now be formulated as: there exists y > 0 
such that for any distribution D over the training sample and any base classifier hz, 
the following holds: 


y(D) 2 7- (6.24) 


This condition is required for the analysis of theorem 6.1 and the non-negativity of 
the coefficients a,;. We will frame boosting as a two-person zero-sum game. 


Definition 6.4 Zero-sum game 

A two-person zero-sum game consists of a loss matrix M € R™*", where m is the 
number of possible actions (or pure strategies) for the row player and n the number 
of possible actions for the column player. The entry Mj; is the loss for the row 
player (or equivalently the payoff for the column payer) when the row player takes 


action i and the column player takes action jp 


An example of a loss matrix for the familiar “rock-paper-scissors” game is shown 
in table 6.1. 


Definition 6.5 Mixed strategy 

A mixed strategy for the row player is a distribution p over the m possible row 
actions, a distribution q over the n possible column actions for the column player. 
The expected loss for the row player (expected payoff for the column player) with 


2. To be consistent with the results discussed in other chapters, we consider the loss matrix 
as opposed to the payoff matrix (its opposite). 
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respect to the mixed strategies p and q is 
m n 


E[loss] = p' Mq = \ S > piMiza; ; 


i=1 j=l 
The following is a fundamental result in game theory proven in chapter 7. 


Theorem 6.4 Von Neumann’s minimax theorem 
For any two-person zero-sum game defined by matriz M, 
min max p' Mq = maxminp! Mq. (6.25) 
Pp o4q q Pp 

The common value in (6.25) is called the value of the game. The theorem states 
that for any two-person zero-sum game, there exists a mixed strategy for each player 
such that the expected loss for one is the same as the expected payoff for the other, 
both of which are equal to the value of the game. Note that, given the row player’s 
strategy, the column player can choose an optimal pure strategy, that is, the column 
player can choose the single strategy corresponding the smallest coordinate of the 
vector p'M. A similar comment applies to the reverse. Thus, an alternative and 
equivalent form of the minimax theorem is 

max min p'Me,=min max e; Mq, (6.26) 

P jé[l,n] qQ ic€[1,m] 

where e; denotes the ith unit vector. We can now view AdaBoost as a zero-sum 
game, where an action of the row player is the selection of a training instance 
x;, i € [l,m], and an action of the column player the selection of a base learner 
hi, t € [1,7]. A mixed strategy for the row player is thus a distribution D over 
the training points’ indices [1,m]. A mixed strategy for the column player is a 
distribution over the based classifiers’ indices [1,7]. This can be defined from a 
non-negative vector @ > 0: the weight assigned to t € [1,7] is a;/|la||,. The 
loss matrix M ¢€ {—1,+1}™*? for AdaBoost is defined by My = yhi(a;) for 
all (7, t) € [1,m] x [1,7]. By von Neumann’s theorem (6.26), the following holds: 


m T 
: : : Ot 
a D(i)yihe(ai) = ma: —— yihe(a; 6.27 
BEB ERE 2, POuche(os) = mag aa Ds Teed (627) 
where D denotes the set of all distributions over the training sample. Let pa(x) 
denote the margin of point « for the classifier defined by g = ie arh;. The result 
can be rewritten as follows in terms of the margins and edges: 


2y* = 2mi D)= in pa(zi) = p*, 6.28 
i nen ee ) ee ane (xi) =p (6.28) 
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where p* is the maximum margin of a classifier and 7* the best possible edge. This 
result has several implications. First, it shows that the weak learning condition 
(y* > 0) implies p* > 0 and thus the existence of a classifier with positive margin, 
which motivates the search for a non-zero margin. AdaBoost can be viewed as an 
algorithm seeking to achieve such a non-zero margin, though, as discussed earlier, 
AdaBoost does not always achieve an optimal margin and is thus suboptimal in that 
respect. Furthermore, we see that the “weak learning” assumption, which originally 
appeared to be the weakest condition one could require for an algorithm (that of 
performing better than random), is in fact a strong condition: it implies that the 
training sample is linearly separable with margin 27* > 0. Linear separability often 
does not hold for the data sets found in practice. 


6.4 Discussion 


AdaBoost offers several advantages: it is simple, its implementation is straightfor- 
ward, and the time complexity of each round of boosting as a function of the sample 
size is rather favorable. As already discussed, when using decision stumps, the time 
complexity of each round of boosting is in O(mN). Of course, if the dimension of 
the feature space N is very large, then the algorithm could become in fact quite 
slow. 

AdaBoost additionally benefits from a rich theoretical analysis. Nevertheless, 
there are still many theoretical questions. For example, as we saw, the algorithm in 
fact does not maximize the margin, and yet algorithms that do maximize the margin 
do not always outperform it. This suggests that perhaps a finer analysis based on a 
notion different from that of margin could shed more light on the properties of the 
algorithm. 

The main drawbacks of the algorithm are the need to select the parameter T and 
the base classifiers, and its poor performance in the presence of noise. The choice of 
the number of rounds of boosting T (stopping criterion) is crucial to the performance 
of the algorithm. As suggested by the VC-dimension analysis, larger values of T’ can 
lead to overfitting. In practice, T is typically determined via cross-validation. 

The choice of the base classifiers is also crucial. The complexity of the family 
of base classifiers H appeared in all the bounds presented and it is important to 
control it in order to guarantee generalization. On the other hand, insufficiently 
complex hypothesis sets could lead to low margins. 

Probably the most serious disadvantage of AdaBoost is its performance in the 
presence of noise; it has been shown empirically that noise severely damages its 
accuracy. The distribution weight assigned to examples that are harder to classify 
substantially increases with the number of rounds of boosting, by the nature of the 
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algorithm. These examples end up dominating the selection of the base classifiers, 
which, with a large enough number of rounds, will play a detrimental role in the 
definition of the linear combination defined by AdaBoost. 

Several solutions have been proposed to address these issues. One consists of using 
a “less aggressive” objective function than the exponential function of AdaBoost, 
such as the logistic loss, to penalize less incorrectly classified points. Another 
solution is based on a regularization, e.g., an L-regularization, which consists of 
adding a term to the objective function to penalize larger weights. This could be 
viewed as a soft margin approach for boosting. However, recent theoretical results 
show that boosting algorithms based on convex potentials do not tolerate even low 
levels of random noise, even with L,-regularization or early stopping. 

The behavior of AdaBoost in the presence of noise can be used, however, as a 
useful feature for detecting outliers, that is, examples that are incorrectly labeled 
or that are hard to classify. Examples with large weights after a certain number of 
rounds of boosting can be identified as outliers. 


6.5 Chapter notes 


The question of whether a weak learning algorithm could be boosted to derive a 
strong learning algorithm was first posed by Kearns and Valiant [1988, 1994], who 
also gave a negative proof of this result for a distribution-dependent setting. The 
first positive proof of this result in a distribution-independent setting was given by 
Schapire [1990], and later by Freund [1990]. 

These early boosting algorithms, boosting by filtering [Schapire, 1990] or boosting 
by majority [Freund, 1990, 1995] were not practical. The AdaBoost algorithm 
introduced by Freund and Schapire [1997] solved several of these practical issues. 
Freund and Schapire [1997] further gave a detailed presentation and analysis of the 
algorithm including the bound on its empirical error, a VC-dimension analysis, and 
its applications to multi-class classification and regression. 

Early experiments with AdaBoost were carried out by Drucker, Schapire, and 
Simard [1993], who gave the first implementation in OCR with weak learners based 
on neural networks and Drucker and Cortes [1995], who reported the empirical 
performance of AdaBoost combined with decision trees, in particular decision 
stumps. 

The fact that AdaBoost coincides with coordinate descent applied to an expo- 
nential objective function was later shown by Duffy and Helmbold [1999], Mason 
et al. [1999], and Friedman [2000]. Friedman, Hastie, and Tibshirani [2000] also 
gave an interpretation of boosting in terms of additive models. They also pointed 
out the close connections between AdaBoost and logistic regression, in particular 
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the fact that their objective functions have a similar behavior near zero or the 
fact that their expectation admit the same minimizer, and derived an alternative 
boosting algorithm, LogitBoost, based on the logistic loss. Lafferty [1999] showed 
how an incremental family of algorithms, including LogitBoost, can be derived from 
Bregman divergences and designed to closely approximate AdaBoost when varying 
a parameter. Kivinen and Warmuth [1999] observed that boosting can be viewed 
as a type of entropy projection. Collins, Schapire, and Singer [2002] later showed 
that boosting and logistic regression were special instances of a common framework 
based on Bregman divergences and used that to give the first convergence proof 
of AdaBoost. Probably the most direct relationship between AdaBoost and logis- 
tic regression is the proof by Lebanon and Lafferty [2001] that the two algorithms 
minimize the same extended relative entropy objective function subject to the same 
feature constraints, except from an additional normalization constraint for logistic 
regression. 

A margin-based analysis of AdaBoost was first presented by Schapire, Freund, 
Bartlett, and Lee [1997], including theorem 6.3 which gives a bound on the empirical 
margin loss. Our presentation is based on the elegant derivation of margin bounds 
by Koltchinskii and Panchenko [2002] using the notion of Rademacher complexity. 
Rudin et al. [2004] gave an example showing that, in general, AdaBoost does not 
maximize the Lj-margin. Ratsch and Warmuth [2002] provided asymptotic lower 
bounds for the margin achieved by AdaBoost under some conditions. The £;-margin 
maximization based on a LP is due to Grove and Schuurmans [1998]. The game- 
theoretic interpretation of boosting and the application of von Neumann’s minimax 
theorem [von Neumann, 1928] in that context were pointed out by Freund and 
Schapire [1996, 1999b]; see also Grove and Schuurmans [1998], Breiman [1999]. 

Dietterich [2000] provided extensive empirical evidence for the fact that noise can 
severely damage the accuracy of AdaBoost. This has been reported by a number of 
other authors since then. Ratsch, Onoda, and Miiller [2001] suggested the use of a 
soft margin for AdaBoost based on a regularization of the objective function and 
pointed out its connections with SVMs. Long and Servedio [2010] recently showed 
the failure of boosting algorithms based on convex potentials to tolerate random 
noise, even with L,-regularization or early stopping. 

There are several excellent surveys and tutorials related to boosting [Schapire, 
2003, Meir and Ratsch, 2002, Meir and Ratsch, 2003]. 


6.6 Exercises 


6.1 VC-dimension of the hypothesis set of AdaBoost. 


Prove the upper bound on the VC-dimension of the hypothesis set Fr of AdaBoost 
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after T rounds of boosting, as stated in equation 6.11. 


6.2 Alternative objective functions. 


This problem studies boosting-type algorithms defined with objective functions 
different from that of AdaBoost. We assume that the training data are given as 
m labeled examples (21, 41),---,(@m,Ym) € X X {—1,+1}. We further assume 
that ® is a strictly increasing convex and differentiable function over R such that: 
Va > 0, ®(x) > 1 and Va < 0, ®(x) > 0. 


(a) Consider the loss function L(a) = Liye ®(—y;g(vi)) where g is a linear 
combination of base classifiers, i-e., g = )>)_, athe (as in AdaBoost). Derive a 
new boosting algorithm using the objective function L. In particular, charac- 
terize the best base classifier h,, to select at each round of boosting if we use 
coordinate descent. 


(b) Consider the following functions: (1) zero-one loss ®;(—u) = 1y<o; (2) least 
squared loss ®j(—u) = (1—u)?; (3) SVM loss ©3(—u) = max{0, 1—u}; and (4) 
logistic loss ®4(—u) = log(1 + e~“). Which functions satisfy the assumptions 
on ® stated earlier in this problem? 


(c) For each loss function satisfying these assumptions, derive the correspond- 
ing boosting algorithm. How do the algorithm(s) differ from AdaBoost? 


6.3 Update guarantee. Assume that the main weak learner assumption of AdaBoost 
holds. Let h; be the base learner selected at round t. Show that the base learner 
he+1 selected at round t+ 1 must be different from hy. 


6.4 Weighted instances. Let the training sample be S = ((#1,4y1),---;(@m;Ym))- 
Suppose we wish to penalize differently errors made on x; versus z;. To do that, we 
associate some non-negative importance weight w; to each point x; and define the 
objective function F(a) = 77", wie #9), where g = a a,h;. Show that this 
function is convex and differentiable and use it to derive a boosting-type algorithm. 


6.5 Define the unnormalized correlation of two vectors x and x’ as the inner product 
between these vectors. Prove that the distribution vector (Di+41(1),..., Drai(m)) 
defined by AdaBoost and the vector of components y;h;(#;) are uncorrelated. 


6.6 Fix € € (0,1/2). Let the training sample be defined by m points in the plane 
with “ negative points all at coordinate (1,1), another 4 negative points all at 
coordinate (—1,—1), a positive points all at coordinate (1,—1), and mares) 


positive points all at coordinate (—1,+1). Describe the behavior of AdaBoost when 
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run on this sample using boosting stumps. What solution does the algorithm return 
after T rounds? 


6.7 Noise-tolerant AdaBoost. AdaBoost may significantly overfitting in the presence 
of noise, in part due to the high penalization of misclassified examples. To reduce 
this effect, one could use instead the following objective function: 


m 


F=)S>G(-vig(ai)), (6.29) 


i=l 


where G is the function defined on R by 


a ife <0 
G@eax * the (6.30) 
x+1 otherwise. 


(a) Show that the function G is convex and differentiable. 


(b) Use F and greedy coordinate descent to derive an algorithm similar to 
AdaBoost. 


(c) Compare the reduction of the empirical error rate of this algorithm with 
that of AdaBoost. 


6.8 Simplified AdaBoost. Suppose we simplify AdaBoost by setting the parameter 
a, to a fixed value ay = a > 0, independent of the boosting round t. 


(a) Let 7 be such that (5 —«,) > y > 0. Find the best value of a as a function 
of y by analyzing the empirical error. 

(b) For this value of a, does the algorithm assign the same probability mass 
to correctly classified and misclassified examples at each round? If not, which 
set is assigned a higher probability mass? 

(c) Using the previous value of a, give a bound on the empirical error of the 
algorithm that depends only on y and the number of rounds of boosting T. 


log m 


(d) Using the previous bound, show that for T > a the resulting hypothesis 
is consistent with the sample of size m. 


(e) Let s be the VC-dimension of the base learners used. Give a bound on the 
generalization error of the consistent hypothesis obtained after T = SP +1 
rounds of boosting. (Hint: Use the fact that the VC-dimension of the family 
of functions {sen(S>)_, ayhy) : a, € R} is bounded by 2(s + 1)T log,(eT)). 
Suppose now that y varies with m. Based on the bound derived, what can be 


said if y(m) = O(,/ °8")?) 
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MATRIX-BASED ADABOOST(M, tmax) 


yj fori =1,...,m 


Aty1 — At + QYe;,, Where e;, is 1 in position j, and 0 elsewhere. 


1 Any =O fort=1,...,% 
2 fort<—1 to tyax do 

; exp(—(MA,);) 
3 di. : ary exp(—(MAz) x 
4 je — argmax, (d/ M); 
5 Te (d/M);, 
6 a, — § log (F#*) 
7 

Atmax 

8 return TT Atenaxlla 


Figure 6.7 Matrix-based AdaBoost. 


6.9 Matrix-based AdaBoost. 


(a) Define an mxn matrix M where M,; = yjh;(x:), i-e., Mi; = +1 if training 
example 7 is classified correctly by weak classifier h;, and —1 otherwise. Let 
di, Az € R”, ||de||1 = 1 and d;,; (respectively A,,;) equal the i‘” component of d; 
(respectively A,). Now, consider the matrix-based form of AdaBoost described 
in figure 6.7 and define M as below with eight training points and eight weak 


classifiers. 
-1 1 
-1 1 
1 -1 
M= 1 -l 
1 -1 
1 ik 
1 ik 
1 1 


1 1 1 -1l -1 J1 
1 =f =i 7 1 iL 
1 1 1 -l1 1 1 
1 1 -1 1 1 1 
1 -1l 1 1 1 -l 
=—1 J 1 1 Lt! i 
—1 1 1 1 -l1 J 


1 1 22). 1 a =) 


Assume that we start with the following initial distribution over the datapoints: 


a= (8 2% 111 5-1 a) 


8 


8 °6'6'6’ 8 ”° 8 


Compute the first few steps of the matrix-based AdaBoost algorithm using M, 
dy, and tinaz = 7. What weak classifier is picked at each round of boosting? 
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Do you notice any pattern? 
(b) What is the LZ; norm margin produced by AdaBoost for this example? 


(c) Instead of using AdaBoost, imagine we combined our classifiers using the 
following coefficients: [2,3,4,1,2,2,1,1] x ae What is the margin in this case? 
Does AdaBoost maximize the margin? 


7 On-Line Learning 


This chapter presents an introduction to on-line learning, an important area with a 
rich literature and multiple connections with game theory and optimization that 
is increasingly influencing the theoretical and algorithmic advances in machine 
learning. In addition to the intriguing novel learning theory questions that they 
raise, on-line learning algorithms are particularly attractive in modern applications 
since they form an attractive solution for large-scale problems. 

These algorithms process one sample at a time and can thus be significantly 
more efficient both in time and space and more practical than batch algorithms, 
when processing modern data sets of several million or billion points. They are 
also typically easy to implement. Moreover, on-line algorithms do not require any 
distributional assumption; their analysis assumes an adversarial scenario. This 
makes them applicable in a variety of scenarios where the sample points are not 
drawn i.i.d. or according to a fixed distribution. 

We first introduce the general scenario of on-line learning, then present and 
analyze several key algorithms for on-line learning with expert advice, including 
the deterministic and randomized weighted majority algorithms for the zero-one 
loss and an extension of these algorithms for convex losses. We also describe and 
analyze two standard on-line algorithms for linear classifications, the Perceptron and 
Winnow algorithms, as well as some extensions. While on-line learning algorithms 
are designed for an adversarial scenario, they can be used, under some assumptions, 
to derive accurate predictors for a distributional scenario. We derive learning 
guarantees for this on-line to batch conversion. Finally, we briefly point out the 
connection of on-line learning with game theory by describing their use to derive a 
simple proof of von Neumann’s minimax theorem. 


7.1 Introduction 


The learning framework for on-line algorithms is in stark contrast to the PAC 
learning or stochastic models discussed up to this point. First, instead of learning 
from a training set and then testing on a test set, the on-line learning scenario mixes 
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the training and test phases. Second, PAC learning follows the key assumption 
that the distribution over data points is fixed over time, both for training and test 
points, and that points are sampled in an i.i.d. fashion. Under this assumption, the 
natural goal is to learn a hypothesis with a small expected loss or generalization 
error. In contrast, with on-line learning, no distributional assumption is made, 
and thus there is no notion of generalization. Instead, the performance of on-line 
learning algorithms is measured using a mistake model and the notion of regret. To 
derive guarantees in this model, theoretical analyses are based on a worst-case or 
adversarial assumption. 

The general on-line setting involves T rounds. At the tth round, the algorithm 
receives an instance x; € X and makes a prediction % € Y. It then receives the true 
label y, € Y and incurs a loss L(%, yz), where L: Y x Y — Ry, is a loss function. 
More generally, the prediction domain for the algorithm may be Y’ 4 ) and the loss 
function defined over Y’ x Y. For classification problems, we often have Y = {0,1} 
and L(y, y’) = |y’ —y|, while for regression Y C R and typically L(y, y’) = (y’ —y)?. 
The objective in the on-line setting is to minimize the cumulative loss: ae L(G; Yt) 
over T rounds. 


7.2. Prediction with expert advice 


We first discuss the setting of online learning with expert advice, and the associated 
notion of regret. In this setting, at the tth round, in addition to receiving 7, € 7, 
the algorithm also receives advice yz; € Y, i € [1, N], from N experts. Following 
the general framework of on-line algorithms, it then makes a prediction, receives 
the true label, and incurs a loss. After T’ rounds, the algorithm has incurred a 
cumulative loss. The objective in this setting is to minimize the regret Rr, also 
called external regret, which compares the cumulative loss of the algorithm to that 
of the best expert in hindsight after T’ rounds: 


T aa ae 
Rr= So LG. ut) — min 7 L(Giis yt): (7.1) 
t=1 


t=1 


This problem arises in a variety of different domains and applications. Figure 7.1 
illustrates the problem of predicting the weather using several forecasting sources 
as experts. 


7.2.1 Mistake bounds and Halving algorithm 


Here, we assume that the loss function is the standard zero-one loss used in 
classification. To analyze the expert advice setting, we first consider the realizable 
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? 


wunderground.com —_bbc.com weather.com cnn.com algorithm 


Figure 7.1 Weather forecast: an example of a prediction problem based on expert 
advice. 


case. As such, we discuss the mistake bound model, which asks the simple question 
“How many mistakes before we learn a particular concept?” Since we are in the 
realizable case, after some number of rounds TJ’, we will learn the concept and no 
longer make errors in subsequent rounds. For any fixed concept c, we define the 
maximum number of mistakes a learning algorithm A makes as 

Ma(c) = max |mistakes(A,c)|. (7.2) 

Cpl T 
Further, for any concept in a concept class C’, the maximum number of mistakes a 
learning algorithm makes is 
Ma(C) = max Ma(c). (7.3) 
cEC 

Our goal in this setting is to derive mistake bounds, that is, a bound M on M,(C). 
We will first do this for the Halving algorithm, an elegant and simple algorithm for 
which we can generate surprisingly favorable mistake bounds. At each round, the 
Halving algorithm makes its prediction by taking the majority vote over all active 
experts. After any incorrect prediction, it deactivates all experts that gave faulty 
advice. Initially, all experts are active, and by the time the algorithm has converged 
to the correct concept, the active set contains only those experts that are consistent 
with the target concept. The pseudocode for this algorithm is shown in figure 7.2. 
We also present straightforward mistake bounds in theorems 7.1 and 7.2, where 
the former deals with finite hypothesis sets and the latter relates mistake bounds to 
VC-dimension. Note that the hypothesis complexity term in theorem 7.1 is identical 
to the corresponding complexity term in the PAC model bound of theorem 2.1. 


Theorem 7.1 
Let H be a finite hypothesis set. Then 


M ratwing( f) S logs |]. (7.4) 


Proof Since the algorithm makes predictions using majority vote from the active 
set, at each mistake, the active set is reduced by at least half. Hence, after log, |H| 
mistakes, there can only remain one active hypothesis, and since we are in the 
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HALVING(#7) 
1 A, HH 
2 fort«1toT do 
3 RECEIVE(2;) 
4 Yi — Masority VOTE( Hi, 21) 
5 RECEIVE(Yz) 
6 if (% Ay) then 
7 Ay. — {c © Hi: c(xr) = yt} 
8 return Hpi 


Figure 7.2 Halving algorithm. 


realizable case, this hypothesis must coincide with the target concept. ™ 


Theorem 7.2 
Let opt(H) be the optimal mistake boundfor H. Then, 


VCdim(H) < opt(H) < Muatving(H) < logs |H]. (7.5) 


Proof The second inequality is true by definition and the third inequality holds 
based on theorem 7.1. To prove the first inequality, we let d = VCdim(H). Then 
there exists a shattered set of d points, for which we can form a complete binary tree 
of the mistakes with height d, and we can choose labels at each round of learning to 
ensure that d mistakes are made. Note that this adversarial argument is valid since 
the on-line setting makes no statistical assumptions about the data. m 


7.2.2 Weighted majority algorithm 


In the previous section, we focused on the realizable setting in which the Halving 
algorithm simply discarded experts after a single mistake. We now move to the 
non-realizable setting and use a more general and less extreme algorithm, the 
Weighted Majority (WM) algorithm, that weights the importance of experts as 
a function of their mistake rate. The WM algorithm begins with uniform weights 
over all N experts. At each round, it generates predictions using a weighted majority 
vote. After receiving the true label, the algorithm then reduces the weight of each 
incorrect expert by a factor of 3 € [0,1). Note that this algorithm reduces to the 
Halving algorithm when @ = 0. The pseudocode for the WM algorithm is shown in 
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WEIGHTED-MAJORITY(NV) 
1 fori«1to N do 


2 Wii -—1 


3 fort<—1toT do 

4 RECEIVE(2¢) 

5 if oi: ypni Wii 2 Lis y <0 Wei then 

6 ye — 1 

7 else jj, — 0 

8 RECEIVE(¥z) 

9 if (4: A yt) then 
10 for i<1to N do 
11 if (yi Ay) then 
12 wWe41,i — BUWt,i 
13 else wi41,5 — Wei 


14. return wr+1 


Figure 7.3 Weighted majority algorithm, y:, y:,: € {0,1}. 


figure 7.3. 

Since we are not in the realizable setting, the mistake bounds of theorem 7.1 
cannot apply. However, the following theorem presents a bound on the number of 
mistakes mp made by the WM algorithm after T > 1 rounds of on-line learning as 
a function of the number of mistakes made by the best expert, that is the expert 
who achieves the smallest number of mistakes for the sequence y1,...,yr. Let us 
emphasize that this is the best expert in hindsight. 


Theorem 7.3 

Fix 3 € (0,1). Let mp be the number of mistakes made by algorithm WM after T > 1 
rounds, and mp be the number of mistakes made by the best of the N experts. Then, 
the following inequality holds: 


log N + mr log 3 


MT < 
log 735 


(7.6) 


Proof To prove this theorem, we first introduce a potential function. We then 
derive upper and lower bounds for this function, and combine them to obtain our 


152 On-Line Learning 


result. This potential function method is a general proof technique that we will use 
throughout this chapter. 

For any t > 1, we define our potential function as W; = ae Wii. Since 
predictions are generated using weighted majority vote, if the algorithm makes 
an error at round t, this implies that 


Wess < [1/2 + (1/296) = [24 wr, (7.7) 


Since W, = N and mr mistakes are made after T rounds, we thus have the following 
upper bound: 
1 de mr 

Wr < a N. (7.8) 
Next, since the weights are all non-negative, it is clear that for any expert i, 
Wr > wri = BT, where mr; is the number of mistakes made by the ith expert 
after T rounds. Applying this lower bound to the best expert and combining it with 
the upper bound in (7.8) gives us: 


1 ee 
B™ Wr < | N 


1 
=> m7 log B < log N + mr log a 


2 1 
=> mr log | < log N + m* log B 


which concludes the proof. 


Thus, the theorem guarantees a bound of the following form for algorithm WM: 
mr < O(log N) + constant x |mistakes of best expert. 


Since the first term varies only logarithmically as a function of N, the theorem 
guarantees that the number of mistakes is roughly a constant times that of the best 
expert in hindsight. This is a remarkable result, especially because it requires no 
assumption about the sequence of points and labels generated. In particular, the 
sequence could be chosen adversarially. In the realizable case where m7 = 0, the 
bound reduces to mp < O(log N) as for the Halving algorithm. 


7.2.3 Randomized weighted majority algorithm 
In spite of the guarantees just discussed, the WM algorithm admits a drawback that 


affects all deterministic algorithms in the case of the zero-one loss: no deterministic 
algorithm can achieve a regret Rr = o(T) over all sequences. Clearly, for any 
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RANDOMIZED-WEIGHTED-MAJORITY (NV) 

1 forzi-1to N do 

2 Wii 1 

3 Pi — 1/N 

4 fort—1toTdo 

5 for i<1to N do 

6 if (I, = 1) then 
7 Wii — Gwe 
8 
9 


else Wi41,5 — Wei 
Wert — ey Wes 3 
10 for i-—1to N do 
ul Pes — Wet1,i/Wisr 
12. return wr41 


Figure 7.4 Randomized weighted majority algorithm. 


deterministic algorithm A and any ¢ € [1,7], we can adversarially select y, to 
be 1 if the algorithm predicts 0, and choose it to be 0 otherwise. Thus, A errs at 
every point of such a sequence and its cumulative mistake is mp = T. Assume for 
example that N = 2 and that one expert always predicts 0, the other one always 1. 
The error of the best expert over that sequence (and in fact amy sequence of that 
length) is then at most m3, < T/2. Thus, for that sequence, we have 


Rr =mr — my > T/2, 


which shows that Rr = o(T) cannot be achieved in general. Note that this does 
not contradict the bound proven in the previous section, since for any 2 € (0,1), 
ors > 2. As we shall see in the next section, this negative result does not hold 
for any loss that is convex with respect to one of its arguments. But for the zero-one 
loss, this leads us to consider randomized algorithms instead. 

In the randomized scenario of on-line learning, we assume that a set A = 
{1,...,.N} of NV actions is available. At each round ¢ € [1,7], an on-line algorithm 
A selects a distribution p;, over the set of actions, receives a loss vector 1;, whose ith 
component 1;,; € [0,1] is the loss associated with action i, and incurs the expected 
loss Ly = ae Pile; The total loss incurred by the algorithm over 7’ rounds 


is Lr = si L;. The total loss associated to action 7 is Lr; = LS iyi. The 
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minimal loss of a single action is denoted by Lo = minje4 £rj;. The regret Rr of 
the algorithm after T rounds is then typically defined by the difference of the loss 
of the algorithm and that of the best single action:! 


ee ee 


Here, we consider specifically the case of zero-one losses and assume that I; ; € {0, 1} 
for all t € [1,7] andie A. 

The WM algorithm admits a straightforward randomized version, the randomized 
weighted majority (RWM) algorithm. The pseudocode of this algorithm is given in 
figure 7.4. The algorithm updates the weight w;,; of expert 7 as in the case of the WM 
algorithm by multiplying it by @. The following theorem gives a strong guarantee 
on the regret Rr of the RWM algorithm, showing that it is in O(./T log N). 


Theorem 7.4 
Fix 8 € [1/2,1). Then, for any T > 1, the loss of algorithm RWM on any sequence 
can be bounded as follows: 


log N ; 
lr < Fo + (2- Ber. (7.9) 
In particular, for 6 = max{1/2,1— ./(log N)/T}, the loss can be bounded as: 
Lr < LE 42,/Tlog N. (7.10) 


Proof As in the proof of theorem 7.3, we derive upper and lower bounds for the 
potential function W; = oy wri, t € [1,7], and combine these bounds to obtain 
the result. By definition of the algorithm, for any ¢ € [1,7], W:41 can be expressed 
as follows in terms of W;: 


Wi41 = > wri + B 2 wii = Wi + (8 - 1) S- Wei 


i: l44=0 i: lga=l i: laa=l 


=W,+ (8—-1)W, = Pt 


4: lgg=l1 
es Wi 7 (8 a 1)WiLt 
= W,(1— (1— 8)L:). 


Thus, since W, = N, it follows that Wr41 = N[]j_,(1— (1 — 8) Ly). On the other 


hand, the following inven bound clearly holds: Wr41 > maxjep1,n] Wr41,i = = por. 
This leads to the following inequality and series of derivations after taking the log 


1. Alternative definitions of the regret with comparison classes different from the set of 
single actions can be considered. 
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and using the inequalities log(1 — x) < —2 valid for all « < 1, and —log(1— 2) < 
x + a? valid for all x € [0, 1/2): 


; T T 
BeP™ <NT[Q-(- B)Li) => Lp log B < log N + $7 log(1 — (1 — 6) L1) 
t=1 


t=1 


v8 
=> LF log 8 < log N - (1-8) 0 iy 
t=1 


=> LE” log B < log N — (1- B)Lr 
2 logN log @ 


logN _ log(1 — (1— 8) am 
L a foe 
~ T<1-8 ie 
log N : 
=+—ye<-" 40-p0m, 
vat 
This shows the first statement. Since £7" < T, this also implies 
log N ; 
Lrsz gt (LAT + cB. (7.11) 


ane —T = 0, that is 8 = Bo = 1— V(logN)/T, since 6 < 1. Thus, if 
1—,/(log N)/T > 1/2, Go is the minimizing value of 3, otherwise 1/2 is the optimal 
value. The second statement follows by replacing @ with Go in (7.11). 


The bound (7.10) assumes that the algorithm additionally receives as a parameter 
the number of rounds T. As we shall see in the next section, however, there exists 
a general doubling trick that can be used to relax this requirement at the price of 
a small constant factor increase. Inequality 7.10 can be written directly in terms of 
the regret Rr of the RWM algorithm: 


Rr <2/Tlog N. (7.12) 


Thus, for N constant, the regret verifies Rr = O(VT ) and the average regret or 
regret per round Rr/T decreases as O(1/V/T). These results are optimal, as shown 
by the following theorem. 


Theorem 7.5 
Let N = 2. There exists a stochastic sequence of losses for which the regret of any 
on-line learning algorithm verifies E|[Rr] > \/T/8. 


Proof For any t € [1,7], let the vector of losses take the values lo; = (0,1)! 
and lig = (1,0)! with equal probability. Then, the expected loss of any randomized 
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algorithm A is 


es 


L T 
=E[Sope-h] = Op Eh |= pp + 3 (1 — pia) = T/2, 
t=1 t=1 


where we denoted by p; the distribution selected by A at round t. By definition, 
cmin can be written as follows: 


1 
Len = min{lLri,£7.2} = 5 (Lr,1 t Lr \Lra Lr2\) = T/2 - \Lra _ T/2\, 
using the fact that C7, + £r.2 = T. Thus, the expected regret of A is 
E[Rr] = E[Lr] — E[L7™) = Bir. — T/2I]. 


Let o;, t € [1,7], denote Rademacher variables taking values in {—1,+1}, then 
Lr can be rewritten as Lr = yy a =T7/2+ $ i 4 oy. Thus, introducing 
scalars x, = 1/2, t € [1,7], by the Khintchine-Kahane inequality (D.22), we have: 


T 
=E [ Y onl] 2 
t=1 


which concludes the proof. 


More generally, for T > N, a lower bound of Rr = 0(./T log N) can be proven for 
the regret of any algorithm. 


7.2.4 Exponential weighted average algorithm 


The WM algorithm can be extended to other loss functions L taking values in 
(0, 1]. The Exponential Weighted Average algorithm presented here can be viewed 
as that extension for the case where L is convex in its first argument. Note that this 
algorithm is deterministic and yet, as we shall see, admits a very favorable regret 
guarantee. Figure 7.5 gives its pseudocode. At round ¢ € [1,7], the algorithm’s 
prediction is 


N 
Ii = y= es (7.13) 
N 
yi Wei 


where y;,; is the prediction by expert 2 and w;,; the weight assigned by the algorithm 
to that expert. Initially, all weights are set to one. The algorithm then updates the 
weights at the end of round ¢ according to the following rule: 


—L(Ye,i; —ht,i 
Wt+1,6 — Wei € nL (Ge.isye) —e tt (7.14) 
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EXPONENTIAL- WEIGHTED-AVERAGE (JV) 
1 fori<1to N do 
2 Wii 1 
for t+_1to T do 


RECEIVE(Zz) 


3 

4 

6 RECEIVE(yz) 

7 for i+ 1to N do 

8 Wi41i — Wei en LG, iYt) 
9 


return wr+1 


Figure 7.5 Exponential weighted average, L(%,i, yx) € (0, 1]. 


where Ly; is the total loss incurred by expert i after t rounds. Note that this 
algorithm, as well as the others presented in this chapter, are simple, since they 
do not require keeping track of the losses incurred by each expert at all previous 
rounds but only of their cumulative performance. Furthermore, this property is also 
computationally advantageous. The following theorem presents a regret bound for 
this algorithm. 


Theorem 7.6 

Assume that the loss function L is convex in its first argument and takes values 
in [0,1]. Then, for any » > 0 and any sequence yi,...,yr € Y, the regret of the 
Exponential Weighted Average algorithm after T rounds satisfies 

logN nT 


+. (7.15) 


Rr< 3 


In particular, for 7 = \/8log N/T, the regret is bounded as 
Rr < V/(T/2) log N. (7.16) 


Proof We apply the same potential function analysis as in previous proofs but 
using as potential ®, = log yy wri, t € [1,7]. Let p,; denote the distribution over 


{1ja.eg¥ } with peo = ETE To derive an upper bound on ®;, we first examine 
i= Wt 


158 On-Line Learning 


the difference of two consecutive potential values: 


N enh (¥e,i ut) 
Deis = log ( Ble"), 
iat Uti re 


with X = —L(%i, yz) € [—1,0]. To upper bound the expression appearing in the 
right-hand side, we apply Hoeffding’s lemma (lemma D.1) to the centered random 
variable X — E,,[X], then Jensen’s inequality (theorem B.4) using the convexity of 
L with respect to its first argument: 


O,41 — & = log 


G41 — ®; = log (E [en X-ELX) +n E[X]]) 


< < +n E[X] = 7 — 1 E[LG a; yt) (Hoeffding’s lemma) 
Pt Pt 
2 
< —nL( Elf). ye) + . (convexity of first arg. of L) 
Pt 


Summing up these inequalities yields the following upper bound: 


x 


2 
ee fe 
Or44 _ D, << =] ; L(Yt, Yt) + — a (7.17) 
t=1 


We obtain a lower bound for the same quantity as follows: 


N 
N 
®741—, = log > e "/T.i_log N > log max e "/T.i_log N = —n min Lr;—log N. 
wz [= 
i=1 
Combining the upper and lower bounds yields: 


T 


N 2T 
- nmin Lr, —logN < —n >> Lt, Ye) + _ 
= t=1 
T 
_ N log N nT 
= S° L(G.) ~min Lr < 8° 
t=1 


and concludes the proof. 


The optimal choice of 7 in theorem 7.6 requires knowledge of the horizon T’, which is 
an apparent disadvantage of this analysis. However, we can use a standard doubling 
trick to eliminate this requirement, at the price of a small constant factor. This 
consists of dividing time into periods [2*,2*+! — 1] of length 2* with k = 0,...,n 
and T > 2”—1, and then choose nz = Steg N 
presents a regret bound when using the doubling trick to select 7. A more general 


in each period. The following theorem 
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method consists of interpreting 7 as a function of time, ie., 7 = \/(8log N)/t, 
which can lead to a further constant factor improvement over the regret bound of 
the following theorem. 


Theorem 7.7 

Assume that the loss function L is convex in its first argument and takes values 
n [0,1]. Then, for any T > 1 and any sequence y1,...,yr € Y, the regret of the 
Exponential Weighted Average algorithm after T rounds is bounded as follows: 


f2 
J2—1 


Proof Let T > 1 and let I, = [2*,2**1 — 1], for k € [0,n], with n = |log(T + 1)|. 
Let Ly, denote the loss incurred in the interval I;,. By theorem 7.6 (7.16), for any 


k € [0,n], we have 
N 
Ly, — min Lyi S$ [2/2 log N. (7.19) 


Thus, we can bound the total loss incurred by the algorithm after T’ rounds as: 


Rr < 


(T/2)log N + ./log N/2. (7.18) 


n n N n 
Lr= Ly, SD mip Li + d 2k (log N)/2 
k=0 


<min nLri + V (log N)/ 2.5224, (7.20) 


k=0 


where the second inequality follows from the super-additivity of min, that is 
min; X; + min; Y; < min;(X;+ Y;) for any sequences (X;); and (Y;);, which implies 
peo min, Ly, < ming, fp» L1,,;. The geometric sum appearing in the right- 
hand side of (7.20) can be expressed as follows: 


og  2V2 1 VaVTFI-1 . VUVT+1)-1 _ V2VT 
» le ST OO a 


Plugging back into (7.20) and rearranging terms yields (7.18). ™ 


The O(VT) dependency on T presented in this bound cannot be improved for 
general loss functions. 


7.3 Linear classification 


This section presents two well-known on-line learning algorithms for linear classifi- 
cation: the Perceptron and Winnow algorithms. 
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PERCEPTRON(W 9) 
1 wi— wo > typically wo = 0 
2 fort<1toT do 
3 RECEIVE(X;) 
4 Yi — sgn(we - Xz) 
5 RECEIVE(Yz) 
6 if (% Ay) then 
7 Wi4+1 — wet ytXt >more generally ny:xXz,7 > 0. 
8 else Wii) <— Wt 
9 return wr+i 


Figure 7.6 Perceptron algorithm. 


7.3.1 Perceptron algorithm 


The Perceptron algorithm is one of the earliest machine learning algorithms. It is 
an on-line linear classification algorithm. Thus, it learns a decision function based 
on a hyperplane by processing training points one at a time. Figure 7.6 gives its 
pseudocode. 

The algorithm maintains a weight vector w; € R% defining the hyperplane 
learned, starting with an arbitrary vector wo. At each round t € [1,7], it predicts 
the label of the point x; € R” received, using the current vector w; (line 4). When 
the prediction made does not match the correct label (lines 6-7), it updates w; by 
adding y:x:. More generally, when a learning rate 7 > 0 is used, the vector added 
is nyzXt. This update can be partially motivated by examining the inner product of 
the current weight vector with y:x;, whose sign determines the classification of x;. 
Just before an update, x; is misclassified and thus yw; - x; is negative; afterward, 
Yewes1 Xt = Yewe YseXt+N||Xz||7, thus, the update corrects the weight vector in the 
direction of making the inner product positive by augmenting it with this quantity 
with 7||x¢||? > 0. 

The Perceptron algorithm can be shown in fact to seek a weight vector w 
minimizing an objective function F' precisely based on the quantities (—y,w - xz), 
t € {1,T]. Since (—y,w - x;) is positive when x; is misclassified by w, F is defined 
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CS 


Figure 7.7 An example path followed by the iterative stochastic gradient descent 
technique. Each inner contour indicates a region of lower elevation. 


for all w € R® by 
1 a: 
F(w) = 7m (0,-u(w-x:)) = E_[F(w,»)], (7.21) 


where F(w,x) = max (0,—f(x)(w -x)) with f(x) denoting the label of x, and 
D is the empirical distribution associated with the sample (x;,...,x7). For any 
t € [1,7], w > —y:(w - x;) is linear and thus convex. Since the max operator pre- 
serves convexity, this shows that F' is convex. However, F' is not differentiable. Nev- 
ertheless, the Perceptron algorithm coincides with the application of the stochastic 
gradient descent technique to F. 

The stochastic (or on-line) gradient descent technique examines one point x; at 
a time. For a function F , a generalized version of this technique can be defined by 
the execution of the following update for each point x;: 


wi — nV wk (we, x1) ifwre F(w, x) differentiable at wz, 
Wt41 (7.22) 


Wi otherwise, 


where 7 > 0 is a learning rate parameter. Figure 7.7 illustrates an example path 
the gradient descent follows. In the specific case we are considering, w +> F(w, Xz) 
is differentiable at any w such that y,(w-x,) 4 0 with Vw (w, Xz) = —yx;, if 
yi(w- x4) <0 and Vw (w, xz) = 0 if y%(w-x;,) > 0. Thus, the stochastic gradient 
descent update becomes 


Wt + 1NUtXt if yi(w * Xz) < 0; 
Wi+1 — ) Wy if y,(w-x;,) > 0; (7.23) 
Wi otherwise, 
which coincides exactly with the update of the Perceptron algorithm. 


The following theorem gives a margin-based upper bound on the number of 
mistakes or updates made by the Perceptron algorithm when processing a sequence 
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of T points that can be linearly separated by a hyperplane with margin p > 0. 


Theorem 7.8 

Let x1,...,X7 € R™ be a sequence of T points with ||x4|| <r for all t € [1,7], for 
some r > 0. Assume that there exist p > 0 and v € RN such that for all t € [1,T], 
ps oe Then, the number of updates made by the Perceptron algorithm when 
processing X1,...,x7 is bounded by r?/p?. 


Proof Let I be the subset of the T rounds at which there is an update, and let 
M be the total number of updates, ie., |Z] = M. Summing up the assumption 
inequalities yields: 


Vv x 
Mp < \2uter viXt | ax 
tel 


IIvl| 
= | So (wes — wt) 


(Cauchy-Schwarz inequality ) 


| (definition of updates) 


tel 

= ||wr4i|l (telescoping sum, wo = 0) 

= S- |we+1||? — || well? (telescoping sum, wo = 0) 
tel 

= | S_ |we + yexell? — |]well? (definition of updates) 
tel 


= | Do 2yewe xe +llxell? 
Sa 


tel <0 


< [3° |lxll? < VMr?. 
tel 


Comparing the left- and right-hand sides gives VM < r/p, that is, M <r?/p?. = 


By definition of the algorithm, the weight vector wr after processing T points is a 
linear combination of the vectors x; at which an update was made: wr = )0, er YiXt- 
Thus, as in the case of SVMs, these vectors can be referred to as support vectors 
for the Perceptron algorithm. 

The bound of theorem 7.8 is remarkable, since it depends only on the normalized 
margin p/r and not on the dimension N of the space. This bound can be shown 
to be tight, that is the number of updates can be equal to r?/p? in some instances 
(see exercise 7.3 to show the upper bound is tight). 

The theorem required no assumption about the sequence of points xj,...,X7. 
A standard setting for the application of the Perceptron algorithm is one where a 
finite sample S' of size m < T is available and where the algorithm makes multiple 
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passes over these m points. The result of the theorem implies that when S' is linearly 
separable, the Perceptron algorithm converges after a finite number of updates and 
thus passes. For a small margin p, the convergence of the algorithm can be quite 
slow, however. In fact, for some samples, regardless of the order in which the points 
in S are processed, the number of updates made by the algorithm is in (2%) (see 
exercise 7.1). Of course, if S is not linearly separable, the Perceptron algorithm does 
not converge. In practice, it is stopped after some number of passes over S. 

There are many variants of the standard Perceptron algorithm which are used 
in practice and have been theoretically analyzed. One notable example is the voted 
Perceptron algorithm, which predicts according to the rule sgn ((>,¢7 crwe) « X), 
where c; is a weight proportional to the number of iterations that w; survives, i.e., 
the number of iterations between w; and wy+41. 

For the following theorem, we consider the case where the Perceptron algorithm 
is trained via multiple passes till convergence over a finite sample that is linearly 
separable. In view of theorem 7.8, convergence occurs after a finite number of 
updates. 

For a linearly separable sample S, we denote by rg the radius of the smallest 
sphere containing all points in S and by pg the largest margin of a separating 
hyperplane for S. We also denote by M(S) the number of updates made by the 
algorithm after training over S. 


Theorem 7.9 

Assume that the data is linearly separable. Let hg be the hypothesis returned by the 
Perceptron algorithm after training over a sample S of size m drawn according to 
some distribution D. Then, the expected error of hg is bounded as follows: 


min (M(S),r2/p2 
B [R(his)] < (a(S), 75/28) 
S~D™ Sv pm+1 m+1 


Proof Let S be a linearly separable sample of size m+ 1 drawn i.i.d. according 
to D and let x be a point in S. If hsg_,s,) misclassifies x, then x must be a support 
vector for hg. Thus, the leave-one-out error of the Perceptron algorithm on sample 
S is at most MIS) The result then follows lemma 4.1, which relates the expected 
leave-one-out error to the expected error, along with the upper bound on M(S) 


given by theorem 7.8. m 


This result can be compared with a similar one given for the SVM algorithm (with 
no offset) in the following theorem, which is an extension of theorem 4.1. We denote 
by Nsv($) the number of support vectors that define the hypothesis hg returned 
by SVMs when trained on a sample S. 
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Theorem 7.10 

Assume that the data is linearly separable. Let hg be the hypothesis returned by 
SVMs used with no offset (b = 0) after training over a sample S of size m drawn 
according to some distribution D. Then, the expected error of hg is bounded as 
follows: 


. N S 2 2 
E [R(hs)] < E min ( sv( ), 73/3) 
Sv D™ Sv Dmt1 m+1 


Proof The fact that the expected error can be upper bounded by the average 
fraction of support vectors (Nsyv(S)/(m + 1)) was already shown by theorem 4.1. 
Thus, it suffices to show that it is also upper bounded by the expected value of 
(r2./p%)/(m + 1). To do so, we will bound the leave-one-out error of the SVM 
algorithm for a sample S of size m +1 by (r2/p%)/(m +1). The result will then 
follow by lemma 4.1, which relates the expected leave-one-out error to the expected 
error. 

Let S = (x1,...,Xm+41) be a linearly separable sample drawn i.i.d. according to 
D and let x be a point in S that is misclassified by hg_;.,}. We will analyze the 
case where X = X,,,41, the analysis of other cases is similar. We denote by S’ the 
sample (x1,...,;Xm)- 

For any gq € [1,m + 1], let G, denote the function defined over R4 by Gg: a 

74a ae aya; yy; (x; -x;). Then, G41 is the objective function of the 
dual optimization problem for SVMs associated to the sample S$ and G;,, the one 
for the sample S$’. Let a € R™+! denote a solution of the dual SVM problem 
mMaXa>0 Gm+i(a) and a’ € R™+! the vector such that (a/,...,0/,)' € R™ isa 
solution of maxg>9 Gm(a) and a’,,, = 0. Let em41 denote the (m+ 1)th unit 
vector in R™*!. By definition of a and a’ as maximizers, maxg>o0 Gm4i(a@’ + 
Bema) < Gm4i(@) and Gin4i(@ — Am41€m+1) < Gm(a’). Thus, the quantity 
A= Gm+41(@) — Gy,(a’) admits the following lower and upper bounds: 


Be0 Gm+i(a’ oe Peta) _ G,(a’) < A < Gm-+1(@) a Grsil@ ~ Beier wi): 


Let w = 37""*' y;a;x; denote the weight vector returned by SVMs for the sample 
S. Since hg misclassifies X41, Xm+1 must be a support vector for hg, thus 


7.8 Linear classification 165 


Ym+1W *Xm+1 = 1. In view of that, the upper bound can be rewritten as follows: 


Gm+1(@) ac Gm+i1(a —_ Om+1em-+1) 
m+1 


1 
= mgt — D> (YiQeXi) » (Yn410m+1%m41) + 5 An-+1 [Xm 4” 


4=1 
1 2 2: 
= Oi gat! — YUm4+1Ww: Xm41) 7 3 m+ Xml 


= 5 Am-+1 m4”. 


Similarly, let w’ = $0", yiaix;. Then, for any 3 > 0, the quantity maximized in 
the lower bound can be written as 


Gmngi(a’ + Bem41) — Gm(a’) 
1 
~ el > Ym-+1(W" ar BXm-+1) ‘ ari) ai 52 [Xmsall” 
1 
= BIL — Ym4+1W" *Xm41) — 5P (Xml: 


The right-hand side is maximized for the following value of 3; 1=¥m+!W m1 


&m+all? 
Plugging in this value in the right-hand side gives 4 ona po. Thus, 
A> eam ueal ert = ; 2? 
2 [Xml 2|[Xm-+1l 


using the fact that ym41W!-Xm41 < 0, since Xm4+1 is misclassified by w’. Comparing 
this lower bound on A with the upper bound previously derived leads to Toa? < 
50,41 1X%m4a\["; that is 
1 1 

SS 


a 
7 I|Xm+i||? ~ 7% 


The analysis carried out in the case x = x,,+1 holds similarly for any x; in S' that is 
misclassified by hg_x,}. Let J denote the set of such indices 7. Then, we can write: 


By (4.18), the following simple expression holds for the margin: 74" a; = 1/p?. 
Using this identity leads to 
mt+1 


|| <rgoai<rg So a= 


2 
_S 

ai 

ie€l i=1 S 


r 
p 
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Since by definition |J| is the total number of leave-one-out errors, this concludes the 
proof. 


Thus, the guarantees given by theorem 7.9 and theorem 7.10 in the separable 
case have a similar form. These bounds do not seem sufficient to distinguish the 
effectiveness of the SVM and Perceptron algorithms. Note, however, that while the 
same margin quantity ps appears in both bounds, the radius rg can be replaced by 
a finer quantity that is different for the two algorithms: in both cases, instead of the 
radius of the sphere containing all sample points, rg can be replaced by the radius 
of the sphere containing the support vectors, as can be seen straightforwardly from 
the proof of the theorems. Thus, the position of the support vectors in the case 
of SVMs can provide a more favorable guarantee than that of the support vectors 
(update vectors) for the Perceptron algorithm. Finally, the guarantees given by 
these theorems are somewhat weak. These are not high probability bounds, they 
hold only for the expected error of the hypotheses returned by the algorithms and 
in particular provide no information about the variance of their error. 

The following theorem presents a bound on the number of updates or mistakes 
made by the Perceptron algorithm in the more general scenario of a non-linearly 
separable sample. 


Theorem 7.11 

Let x1,...,X7 € R™ be a sequence of T points with ||x,|| <r for all t € [1,7], for 
some r > 0. Let v € R® be any vector with ||v|| = 1 and let p > 0. Define the 
deviation of x, by d, = max{0,p — y:(v- xz)}, and let 6 = nae d?. Then, the 
number of updates made by the Perceptron algorithm when processing X1,...,X7T is 
bounded by (r + 6)?/p?. 


Proof We first reduce the problem to the separable case by mapping each input 
vector x, € R™ to a vector in x, € RN+” as follows: 


Lt. T 
Lt,1 ses LtN QO... O A QO... O 
Xt — led xX, => ad 5 
(N + t)th 
component 
Tt,N 


where the first N components of x}, are identical to those of x and the only other 
non-zero component is the (N + t)th component and is equal to A. The value of 
the parameter A will be set later. The vector v is replaced by the vector v’ defined 
as follows: 


v= |n/Z - UN/Z yidi/(AZ) ... yrdr/(AZ) 


The first N components of v’ are equal to the components of v/Z and the remaining 
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DUALPERCEPTRON(Qo) 
1 axa > typically ag = 0 
2 fort<«1toT do 
3 RECEIVE(X;) 
4 Re sen(d>._, AsYs(Xs * Xt) 
5 RECEIVE(Yz) 
6 if (% Ay) then 
7 Ar41 — a +1 
8 else ay41 <— a4 
9 return a 


Figure 7.8 Dual Perceptron algorithm. 


T components are functions of the labels and deviations. Z is chosen to guarantee 
that ||v’|| = 1: Z = ,/1+ £. The predictions made by the Perceptron algorithm 
for x}, t € [1,7] coincide with those made in the original space for x;, t € [1,7]. 
Furthermore, by definition of v’ and x}, we can write for any t € [1,7]: 


nie x)= (Cpe aii) 
_ Yev: Xe de 
—  e Z 
so BEY p— yt(v + Xt) _?P 
= Z Z 
where the inequality results from the definition of the deviation d;. This shows that 
the sample formed by x{,...,x’p is linearly separable with margin p/Z. Thus, in 


view of theorem 7.8, since ||x/||? <r? + A?, the number of updates made by the 

2 2 2 2 
Perceptron algorithm is bounded by Agee Cee Le Choosing A? to minimize 
this bound leads to A? = r6é. Plugging in this value yields the statement of the 


theorem. # 


The main idea behind the proof of the theorem just presented is to map input points 
to a higher-dimensional space where linear separation is possible, which coincides 
with the idea of kernel methods. In fact, the particular kernel used in the proof is 
close to a straightforward one with a feature mapping that maps each data point 
to a distinct dimension. 

The Perceptron algorithm can in fact be generalized, as in the case of SVMs, 
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KERNELPERCEPTRON(Q9) 


1 axa > typically ag = 0 

2 fort<1toT do 

3 RECEIVE(2;) 

4 Te — sen(Dos_1 sys K (as, 22) 
5 RECEIVE(Yz) 

6 if (¥: Ay) then 

7 Or41 — a +1 

8 else ay41 <— a4 

9 return a 


Figure 7.9 Kernel Perceptron algorithm for PDS kernel Kk. 


to define a linear separation in a high-dimensional space. It admits an equivalent 
dual form, the dual Perceptron algorithm, which is presented in figure 7.8. The 
dual Perceptron algorithm maintains a vector a € R” of coefficients assigned to 
each point x;, t € [1,7]. The label of a point x; is predicted according to the rule 
sgn(w-x;), where w = .s QsYsXs- The coefficient a; is incremented by one when 
this prediction does not match the correct label. Thus, an update for x; is equivalent 
to augmenting the weight vector w with y:x:, which shows that the dual algorithm 
matches exactly the standard Perceptron algorithm. The dual Perceptron algorithm 
can be written solely in terms of inner products between training instances. Thus, as 
in the case of SVMs, instead of the inner product between points in the input space, 
an arbitrary PDS kernel can be used, which leads to the kernel Perceptron algorithm 
detailed in figure 7.9. The kernel Perceptron algorithm and its average variant, 
i.e., voted Perceptron with uniform weights c;, are commonly used algorithms in a 
variety of applications. 


7.3.2 Winnow algorithm 


This section presents an alternative on-line linear classification algorithm, the 
Winnow algorithm. Thus, it learns a weight vector defining a separating hyperplane 
by sequentially processing the training points. As suggested by the name, the 
algorithm is particularly well suited to cases where a relatively small number of 
dimensions or experts can be used to define an accurate weight vector. Many of the 
other dimensions may then be irrelevant. 
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WINNOW(1) 
1 wy—1/N 
2 fort<1toT do 
3 RECEIVE(X;) 
4 Yi — sgn(we - Xz) 
5 RECEIVE(Yz) 
6 if (%: Ay) then 
7 Zip oan Wei CXP(NY+Lti) 
8 for i 1to N do 
9 imey gee Oe exp(nytt,i) 


Zt 
10 else wii1 <— wy 


11 return wr+1 


Figure 7.10 Winnow algorithm, with y% € {—1,+1} for all t € [1,7]. 


The Winnow algorithm is similar to the Perceptron algorithm, but, instead of 
the additive update of the weight vector in the Perceptron case, Winnow’s update 
is multiplicative. The pseudocode of the algorithm is given in figure 7.10. The 
algorithm takes as input a learning parameter 7 > 0. It maintains a non-negative 
weight vector w, with components summing to one (||w:||1 = 1) starting with the 
uniform weight vector (line 1). At each round t € [1,7], if the prediction does not 
match the correct label (line 6), each component w;;, 7 € [1, N], is updated by 
multiplying it by exp(7y:21,;) and dividing by the normalization factor Z;, to ensure 
that the weights sum to one (lines 7-9). Thus, if the label y, and x;,; share the same 
sign, then w;,; is increased, while, in the opposite case, it is significantly decreased. 

The Winnow algorithm is closely related to the WM algorithm: when x;; € 
{—1,+1}, sgn(w;-x,) coincides with the majority vote, since multiplying the weight 
of correct or incorrect experts by e” or e~” is equivalent to multiplying the weight of 
incorrect ones by 36 = e~2”. The multiplicative update rule of Winnow is of course 
also similar to that of AdaBoost. 

The following theorem gives a mistake bound for the Winnow algorithm in 
the separable case, which is similar in form to the bound of theorem 7.8 for the 
Perceptron algorithm. 
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Theorem 7.12 

Let x1,...,x7 € R™ be a sequence of T points with ||r:\|o0 < Too for all t € [1,T], 
for some rag > 0. Assume that there exist v € RN, v > 0, and ps. > 0 such that for 
allt € [1,T], po < we Then, for n = i the number of updates made by the 
Winnow algorithm when processing X1,...,X7 is upper bounded by 2 (r2,/p2,) log N. 


Proof Let I C {1,...,7} be the set of iterations at which there is an update, 
and let M be the total number of updates, i.e., |Z] = M. The potential function ®,, 
t € [1,7], used for this proof is the relative entropy of the distribution defined by the 
normalized weights v;/||v||; > 0, i € [1, N], and the one defined by the components 
of the weight vector wi, i € [1, N]: 


N 
®, = D2 Vi log vi/|Ivlh 
= vil 


Wei 


To derive an upper bound on ®;, we analyze the difference of the potential functions 
at two consecutive rounds. For all t € J, this difference can be expressed and 
bounded as follows: 


Vi Wt, i 
Pi44 = D, = 2 log ty ; 


ou U; Zi 
= s log 
— vil 


exp(Yyt@t,i) 


N 

e Ui 

log Z: — 7 ) whe 
i=l 


N 
< log BS Wei exp(nyictei)| = Nox 
w=1 


= log E [exp(ny:x1)| — "Poo 
< log [ exp(177(2r0)?/8)] — Noo 
= 1° 30/2 — NP oo 


The first inequality follows the definition p.. The subsequent equality rewrites 
the summation as an expectation over the distribution defined by w;. The next 
inequality uses Hoeffding’s lmma (lemma D.1). Summing up these inequalities 
over all t € I yields: 


®py1 — OB, < M(n?r2,/2 — npoo). 
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Next, we derive a lower bound by noting that 


N N 
vi vi/|vlla Vi vi 
Q, = l = log N 4 l <logN. 
= an 8 yw 8 a IM? iv 


Additionally, since the relative entropy is always non-negative, we have ®p1, > 0. 
This yields the following lower bound: 


®r,, — ®, > 0-log N =—logN. 


Combining the upper and lower bounds we see that —log N < M(n?r2,/2 — poo). 
Setting 7 = 43° yields the statement of the theorem. m 


r 


The margin-based mistake bounds of theorem 7.8 and theorem 7.12 for the Percep- 
tron and Winnow algorithms have a similar form, but they are based on different 
norms. For both algorithms, the norm || - ||, used for the input vectors x;, t € [1,7], 
is the dual of the norm || - ||, used for the margin vector v, that is p and q are 
conjugate: 1/p + 1/q = 1: in the case of the Perceptron algorithm p = q = 2, while 
for Winnow p= o and q=1. 

These bounds imply different types of guarantees. The bound for Winnow is 
favorable when a sparse set of the experts i € [1, N] can predict well. For example, 
if v = e; where e is the unit vector along the first axis in RN and if x, € {—1,+1}% 
for all t, then the upper bound on the number of mistakes given for Winnow by 
theorem 7.12 is only log N, while the upper bound of theorem 7.8 for the Perceptron 
algorithm is N. The guarantee for the Perceptron algorithm is more favorable in 
the opposite situation, where sparse solutions are not effective. 


7.4 On-line to batch conversion 


The previous sections presented several algorithms for the scenario of on-line 
learning, including the Perceptron and Winnow algorithms, and analyzed their 
behavior within the mistake model, where no assumption is made about the way the 
training sequence is generated. Can these algorithms be used to derive hypotheses 
with small generalization error in the standard stochastic setting? How can the 
intermediate hypotheses they generate be combined to form an accurate predictor? 
These are the questions addressed in this section. 

Let H be a hypothesis of functions mapping ¥ to Y’, and let L: YW’ x Y > Ry 
be a bounded loss function, that is L < M for some M > 0. We assume a standard 
supervised learning setting where a labeled sample S = ((21,41),---;(@7,yr)) € 
(¥ x Y)? is drawn i.i.d. according to some fixed but unknown distribution D. The 
sample is sequentially processed by an on-line learning algorithm A. The algorithm 
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starts with an initial hypothesis hi € H and generates a new hypothesis hii; € H, 
after processing pair (x;,y;), i € [l,m]. The regret of the algorithm is defined as 
before by 


Rr = 2 L(hi(a5), ys) — te L(h(x;), yi)- (7.24) 


The generalization error of a hypothesis h € H is its expected loss R(h) = 
E(a,y)~p[L(h(2), y))]- 

The following lemma gives a bound on the average of the generalization errors of 
the hypotheses generated by A in terms of its average loss + ya L(hi (xi), yi)- 


Lemma 7.1 

Let S = ((x1,y1),---,(@7, yr)) € (Xx YV)" be a labeled sample drawn i.i.d. according 
to D, L a loss bounded by M and hy,...,hr+1 the sequence of hypotheses generated 
by an on-line algorithm A sequentially processing S. Then, for any 6 > 0, with 
probability at least 1 — 6, the following holds: 


. in 2log 4 
hi) < = Y- L(hi(2;), ys) + M a 2 
dR )s TD (hhs(=vi)s ys) + a (7.25) 


Ae 


Proof For any i € [1,T], let V; be the random variable defined by V; = R(hi) — 
L(hi(x;), yi). Observe that for any 7 € [1, T], 


Since the loss is bounded by M, V; takes values in the interval [-M,+M] for 
all i € [1,7]. Thus, by Azuma’s inequality (theorem D.2), Pr[4 Sa Vi>qd< 
exp(—2Te?/(2M)?)). Setting the right-hand side to be equal to 5 > 0 yields the 
statement of the lemma. um 


When the loss function is convex with respect to its first argument, the lemma 
can be used to derive a bound on the generalization error of the average of the 
hypotheses generated by A, ae h;, in terms of the average loss of A on S, or 
in terms of the regret Ry and the infimum error of hypotheses in H. 


Theorem 7.13 

Let S = ((21,y1),---,(@r,yr)) € (¥& x Y)? be a labeled sample drawn i.i.d. 
according to D, L a loss bounded by M and convex with respect to its first argument, 
and hi,...,hr41 the sequence of hypotheses generated by an on-line algorithm A 
sequentially processing S. Then, for any 6 > 0, with probability at least 1— 6, each 
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of the following holds: 


ie i 2 log 
= Sa) Ieee (a). uy: / é : 
R(z > n:) SF S > L(hi(ai), yz) + M 7 (7.26) 


or: 2 
1 : Rr 2log 5 
—\~h;) < inf R(h) 4 + 2M) | 2 
RZ 4 ) < int, RO \+ op ii ae 


Proof By the compenty of L with Peeper to its first argument, for any (x,y) € 
xX x Y, we have ao a fi (z),y) SF Le , L(hi(x), y). Taking the expectation 
gives R(4 pane pa) Sa oe , R(A;). The first inequality then follows by lemma 7.1. 
Thus, by definition of the regret Rr, for any 6 > 0, the following holds with 


probability at least 1 — 6/2: 


By definition of infpey R(h), for any « > 0, there exists h* € H with R(h*) < 
infpen R(h) + €. By Hoeffding’s inequality, for any 6 > 0, with probability at least 


1— 6/2, Ly L(h* (xi), yi) < R(h*) + My/ —_ ——-+. Thus, for any € > 0, by the 
union bound, the following holds with eRebaLliy at least 1 — 06: 


T 2 
1 p Res oe 5 
R(s > hi) <3 <= » L(h*(2;), ys) a + M 


< R(h") yf 85 or | yf 083 
2log 2 
= R(a*) + 22 + 2M 4] a 
. Rr | 2log 
< —. ‘ 
< inf R(h) +e + Tr + 2M a 


Since this inequality holds for all « > 0, it implies the second statement of the 


theorem. # 


The theorem can be applied to a variety of on-line regret minimization algorithms, 
for example when Rr/T = O(1/VT). In particular, we can apply the theorem to 
the exponential weighted average algorithm. Assuming that the loss L is bounded 
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by M = 1 and that the number of rounds T is known to the algorithm, we can use 
the regret bound of theorem 7.6. The doubling trick (used in theorem 7.7) can be 
used to derive a similar bound if T is not known in advance. Thus, for any 6 > 0, 
with probability at least 1 — 6, the following holds for the generalization error of 
the average of the hypotheses generated by exponential weighted average: 


T 2 

1 . log N 2 log $ 
=) ohi)< + 4/ + 24] 

r( e hs) int R(h) 5 2 , 


where N is the number of experts, or the dimension of the weight vectors. 


7.5  Game-theoretic connection 


The existence of regret minimization algorithms can be used to give a simple proof 
of von Neumann’s theorem. For any m > 1, we will denote by A,, the set of all 
distributions over {1,...,m}, that is A,, = {p € R™: p> 0A ||p|li = 1}. 


Theorem 7.14 Von Neumann’s minimax theorem 
Let m,n > 1. Then, for any two-person zero-sum game defined by matrix M € 
R”xn 
5 TT : T 

min max p Mq= max min Mq. 7.28 

PEAm qeAn od ia qeA, pEeAm E ss ( ) 
Proof The inequality maxg minp p'Mq< minp maXq p! Mg is straightforward, 
since by definition of min, for all p € Aj,,q € An, we have minp p'Mq < p' Mq. 
Taking the maximum over q of both sides gives: maxg minp p'Mq< maxXq p'Mq 
for all p, subsequently taking the minimum over p proves the inequality.” 

To show the reverse inequality, consider an on-line learning setting where at each 
round ¢ € {1,7], algorithm A returns p; and incurs loss Mq;. We can assume that 
qr is selected in the optimal adversarial way, that is qr € argmax,c,,, p; Mq, 
and that A is a regret minimization algorithm, that is Rr/T — 0, where Rr = 
an p! Ma: — minpea,, Re p'Mq. Then, the following holds: 


Os T T 
1 u 1 1 
: T Si s T > T 
Mq< (= ) Mgq < = Mq= = Mq:. 
ae aon p qs a T = Pr qs T os re q T ~ Py qt 


2. More generally, the maxmin is always upper bounded by the minmax for any function 
or two arguments and any constraint sets, following the same proof. 
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By definition of regret, the right-hand side can be expressed and bounded as follows: 


T 


TE T 
eee, . od ee te cel . 
TPE Mae = in op" Mau + = main pM (= ae) + 


Rr 
< ma: in p'Mq+ —. 
<a, pean AT op 


This implies that the following bound holds for the minmax for all T' > 1: 


. : T Rr 
min max p'Mq < max min p!' Mq+ — 
peAm qeAn qeAn pEAnm T 


Since limp_. +00 —— = 0, this shows that minp maxq p'Mq< maXg Minp p'Mq. ug 


7.6 Chapter notes 


Algorithms for regret minimization were initiated with the pioneering work of 
Hannan [1957] who gave an algorithm whose regret decreases as O( VT) as a function 
of T but whose dependency on N is linear. The weighted majority algorithm and 
the randomized weighted majority algorithm, whose regret is only logarithmic in N, 
are due to Littlestone and Warmuth [1989]. The exponentiated average algorithm 
and its analysis, which can be viewed as an extension of the WM algorithm to 
convex non-zero-one losses is due to the same authors [Littlestone and Warmuth, 
1989, 1994]. The analysis we presented follows Cesa-Bianchi [1999] and Cesa-Bianchi 
and Lugosi [2006]. The doubling trick technique appears in Vovk [1990] and Cesa- 
Bianchi et al. [1997]. The algorithm of exercise 7.7 and the analysis leading to a 
second-order bound on the regret are due to Cesa-Bianchi et al. [2005]. The lower 
bound presented in theorem 7.5 is from Blum and Mansour [2007]. 

While the regret bounds presented are logarithmic in the number of the experts 
N, when N is exponential in the size of the input problem, the computational 
complexity of an expert algorithm could be exponential. For example, in the on- 
line shortest paths problem, N is the number of paths between two vertices of 
a directed graph. However, several computationally efficient algorithms have been 
presented for broad classes of such problems by exploiting their structure [Takimoto 
and Warmuth, 2002, Kalai and Vempala, 2003, Zinkevich, 2003]. 

The notion of regret (or external regret) presented in this chapter can be gener- 
alized to that of internal regret or even swap regret, by comparing the loss of the 
algorithm not just to that of the best expert in retrospect, but to that of any modi- 
fication of the actions taken by the algorithm by replacing each occurrence of some 
specific action with another one (internal regret), or even replacing actions via an ar- 
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bitrary mapping (swap regret) [Foster and Vohra, 1997, Hart and Mas-Colell, 2000, 
Lehrer, 2003]. Several algorithms for low internal regret have been given [Foster 
and Vohra, 1997, 1998, 1999, Hart and Mas-Colell, 2000, Cesa-Bianchi and Lugosi, 
2001, Stoltz and Lugosi, 2003], including a conversion of low external regret to low 
swap regret by Blum and Mansour [2005]. 

The Perceptron algorithm was introduced by Rosenblatt [1958]. The algorithm 
raised a number of reactions, in particular by Minsky and Papert [1969], who 
objected that the algorithm could not be used to recognize the XOR function. 
Of course, the kernel Perceptron algorithm already given by Aizerman et al. [1964] 
could straightforwardly succeed to do so using second-degree polynomial kernels. 
The margin bound for the Perceptron algorithm was proven by Novikoff [1962] 
and is one of the first results in learning theory. The leave-one-out analysis for 
SVMs is described by Vapnik [1998]. The upper bound presented for the Perceptron 
algorithm in the non-separable case is by Freund and Schapire [1999a]. The Winnow 
algorithm was introduced by Littlestone [1987]. 

The analysis of the on-line to batch conversion and exercise 7.10 are from Cesa- 
Bianchi et al. [2001, 2004] (see also Littlestone [1989]). Von Neumann’s minimax 
theorem admits a number of different generalizations. See Sion [1958] for a gener- 
alization to quasi-concave-convex functions semi-continuous in each argument and 
the references therein. The simple proof of von Neumann’s theorem presented here 
is entirely based on learning-related techniques. A proof of a more general version 
using multiplicative updates was presented by Freund and Schapire [1999b]. 

On-line learning is a very broad and fast-growing research area in machine 
learning. The material presented in this chapter should be viewed only as an 
introduction to the topic, but the proofs and techniques presented should indicate 
the flavor of most results in this area. For a more comprehensive presentation of on- 
line learning and related game theory algorithms and techniques, the reader could 
consult the book of Cesa-Bianchi and Lugosi [2006]. 


7.7 Exercises 


7.1 Perceptron lower bound. Let S be a labeled sample of m points in RY with 
i= ((-1)’,...,(—1)*, (-1)*7,0,...,0) and Y= (-1)**. (7.29) 
ee eee 
i first components 
Show that the Perceptron algorithm makes (2%) updates before finding a sepa- 


rating hyperplane, regardless of the order in which it receives the points. 


7.2 Generalized mistake bound. Theorem 7.8 presents a margin bound on the 
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ON-LINE-SVM(wo) 


1 wi— wo > typically wo = 0 

2 fort<1toT do 

3 RECEIVE(Xz, Yz) 

4 if y.(w,- xz) <1 then 

5 Wt+1 — Wt — n(we _ CytXt) 
6 elseif y,(w;- xz) > 1 then 

7 Wi+1 — Wi — 7Wt 

8 else Wii) <— Wt 

9 return wr+ 


Figure 7.11 On-line SVM algorithm. 


maximum number of updates for the Perceptron algorithm for the special case 
7” = 1. Consider now the general Perceptron update wy4, — wz + 7yzXz, where 
7 > 0. Prove a bound on the maximum number of mistakes. How does 77 affect the 
bound? 


7.3 Sparse instances. Suppose each input vector x;, t € [1,7], coincides with the 
tth unit vector of R’. How many updates are required for the Perceptron algorithm 
to converge? Show that the number of updates matches the margin bound of 
theorem 7.8. 


7.4 Tightness of lower bound. Is the lower bound of theorem 7.5 tight? Explain why 
or show a counter-example. 


7.5 On-line SVM algorithm. Consider the algorithm described in figure 7.11. Show 
that this algorithm corresponds to the stochastic gradient descent technique applied 
to the SVM problem (4.23) with hinge loss and no offset (i.e., fix p= 1 and b= 0). 


7.6 Margin Perceptron. Given a training sample S' that is linearly separable with 
a maximum margin p > 0, theorem 7.8 states that the Perceptron algorithm run 
cyclically over S is guaranteed to converge after at most R?/p? updates, where R is 
the radius of the sphere containing the sample points. However, this theorem does 
not guarantee that the hyperplane solution of the Perceptron algorithm achieves 
a margin close to p. Suppose we modify the Perceptron algorithm to ensure that 
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MARGINPERCEPTRON() 
1 wi<O0 
2 fort<1toT do 
3 RECEIVE(X;) 
4 RECEIVE(yz) 
5 if ((w; = 0) or (aa £)) then 
6 Wry1 — Wi + YteXe 
7 else wii) <— Ww; 
8 return wr+ 


Figure 7.12 Margin Perceptron algorithm. 


the margin of the hyperplane solution is at least p/2. In particular, consider the 
algorithm described in figure 7.12. In this problem we show that this algorithm 
converges after at most 16R?/p? updates. Let I denote the set of times t € [1,T] 
at which the algorithm makes an update and let M = |J| be the total number of 
updates. 


(a) Using an analysis similar to the one given for the Perceptron algorithm, 
2 
show that Mp < ||wr41||. Conclude that if ||wr4il| < 42, then M < 4R?/p?. 


(For the remainder of this problem, we will assume that ||wr4i|| > 4k’) 
(b) Show that for any ¢ € J (including t = 0), the following holds: 
lIwesall? < (Ilwell + 9/2)? + R?. 
(c) From (b), infer that for any t € I we have 
R?2 
24 ‘ 
I well + [lwe+il] + 0/2 


I[we+all < [lwell + / 


(d) Using the inequality from (c), show that for any t € I such that either 
2 2 
|wz|| > 7 or ||wi41]| > at we have 


3 
IIwe+all < [lwell + Ze. 


(ec) Show that ||w,|| < R < 4R?/p. Since by assumption we have ||wr+1|| > 


a conclude that there must exist a largest time to € J such that ||wz, || < tis 
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and ||wi41l| > 2. 
(f) Show that ||wr+:]| < ||we,|| + $Mp. Conclude that M < 16R?/p?. 


7.7 Second-order regret bound. Consider the randomized algorithm that differs from 
the RWM algorithm only by the weight update, ie., wiz. <— (1 — (1 — Blea) wei, 
t € [1,7], which is applied to all i € [1, N] with 1/2 < 6 <1. This algorithm can 
be used in a more general setting than RWM since the losses /;,; are only assumed 
to be in [0, 1]. The objective of this problem is to show that a similar upper bound 
can be shown for the regret. 


(a) Use the same potential W; as for the RWM algorithm and derive a simple 
upper bound for log Wr41: 


log Wri < log N = (1 = B)Lr. 


(Hint: Use the identity log(1 — x) < —a for x € [0,1/2].) 
(b) Prove the following lower bound for the potential for all i € [1, N]: 


Re 
log Wr41 > —(1- B)Lra - (1-8)? SOB. 
t=1 


(Hint: Use the identity log(1—x) > —x —2?, which is valid for all x € [0, 1/2].) 


(c) Use upper and lower bounds to derive the following regret bound for the 
algorithm: Rr < 2,/T log N. 


7.8 Polynomial weighted algorithm. The objective of this problem is to show how 
another regret minimization algorithm can be defined and studied. Let D be a loss 
function convex in its first argument and taking values in [0, M]. 


We will assume N > e? and then for any expert i € [1, N], we denote by rz; the 


instantaneous regret of that expert at time t € [1,7], ri; = L(Y, yx) —L (yea, ye), and 
by A, his cumulative regret up to time t: Ry; = ae r¢,;. For convenience, we also 
define Ro; = 0 for alli € [{1, N]. For any x € R, (x)4 denotes max(z,0), that is the 
positive part of x, and for x = (z1,...,2w)' ERY, (x)4 = ((z1)4,---,(aw)4)'- 


Let a > 2 and consider the algorithm that predicts at round t € [1,7] according 


to % = So, with the weight w;,;, defined based on the ath power of 
the regret up to time (t— 1): wy = (igs The potential function we 


use to analyze the algorithm is based on the function ® defined over RY by 
N ala 
B: x ||0)412 = [Lia (es) 4] *- 
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(a) Show that © is twice differentiable over RY — B, where B is defined as 
follows: 


B={ueR*: (uw), = 0}. 


(b) For any ¢t € [1,7], let r; denote the vector of instantaneous regrets, 
tr, = (ria,---,Tt,.)!, and similarly Ry = (Ri1,...,Ri,.)'. We define the 
potential function as ®(R;) = ||(R,)+||?. Compute V®(R,_1) for Ry; ¢ B 
and show that V®(R;_1)-r; < 0 (Hint: use the convexity of the loss with 
respect to the first argument). 

(c) Prove the inequality r'[V?®(u)|r < 2(a@ —1)]||r||?, valid for all r € RN and 
u € RY — B (Hint: write the Hessian V?(u) as a sum of a diagonal matrix 
and a positive semi-definite matrix multiplied by (2 — a). Also, use Holder’s 
inequality generalizing Cauchy-Schwarz : for any p > land q > 1 with ear —ail 
and u,v € R%, |u- v| < |lullplivila)- 

(d) Using the answers to the two previous questions and Taylor’s formula, show 
that for all t > 1, ®(Rz) — 6(Ry_1) < (a—1)|rz||2, if yRe-1 + (1-—7)Ri ¢ B 
for all y € [0,1]. 

(e) Suppose there exists y € [0,1] such that (1—~y)R;i_1 +7yR; € B. Show that 
®(Ri) < (a — 1)llrilla. 


(f) Using the two previous questions, derive an upper bound on ®(Rr) ex- 
pressed in terms of T, N, and M. 

(g) Show that ®(Rv) admits as a lower bound the square of the regret Rr of 
the algorithm. 

(h) Using the two previous questions give an upper bound on the regret Rr. 
For what value of a is the bound the most favorable? Give a simple expression 
of the upper bound on the regret for a suitable approximation of that optimal 
value. 


7.9 General inequality. In this exercise we generalize the result of exercise 7.7 by 
2 
using a more general inequality: log(1 — x) > —a — =~ for some 0 < a < 2. 


(a) First prove that the inequality is true for x € [0,1 — $]. What does this 
imply about the valid range of 3? 

(b) Give a generalized version of the regret bound derived in exercise 7.7 in 
terms of a, which shows: 


logN 1-8 
—T. 
tae a 


What is the optimal choice of 3 and the resulting bound in this case? 


Rr< 
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(c) Explain how a may act as a regularization parameter. What is the optimal 
choice of a? 


7.10 On-line to batch. Consider the margin loss (4.3), which is convex. Our goal is 
to apply theorem 7.13 to the kernel Perceptron algorithm using the margin loss. 


(a) Show that the regret Rr can be bounded as Rr < \/Tr[K]/p? where p is 
the margin and K is the kernel matrix associated to the sequence 2,..., U7. 


(b) Apply theorem 7.13. How does this result compare with the margin bounds 
for kernel-based hypotheses given by corollary 5.1? 


7.11 On-line to batch — non-convex loss. The on-line to batch result of theorem 7.13 
heavily relies on the fact that the loss in convex in order to provide a generalization 
guarantee for the uniformly averaged hypothesis + ar h;. For general losses, 
instead of using the averaged hypothesis we will use a different strategy and try 
to estimate the best single base hypothesis and show the expected loss of this 
hypothesis is bounded. 


Let m; denote the number of errors of hypothesis h; makes on the points 
(a;,...,27), ie. the subset of points in the sequence that are not used to train 
h;. Then we define the penalized risk estimate of hypothesis h; as, 


i . 1 T(T +1 
eh +c5(T —i+1) where cs(x) = oe log net) . 
The term cs penalizes the empirical error when the test sample is small. Define 
h = hj» where ix = argmin,; m;/(T — 7) +c¢5(T —7+1). We will then show under the 
same conditions of theorem 7.13 (with M = 1 for simplicity), but without requiring 
the convexity of L, that the following holds with probability at least 1 — 0: 


T 
| i. +i) 
<=> (xi), Yi = l6e 
R(h) < T 2. L(hi(ai), yi) + 6 T log 5 (7.30) 
(a) Prove the following inequality: 
ie 1, T+i 
3 2 = ce 
a +2ce5(T -—i+1))< r > R(hi) +4 7p los 5 
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(b) Use part (a) to show that with probability at least 1 — 4, 


min (R(h;) + 2c5(T — i+ 1)) 


i€([1,T] 
Z f2. 1 Fad 
< S © L(hi(xa), yi) ples 5 + 4 7 bee . 


w=1 


(c) By design, the definition of cs ensures that with probability at least 1— 6 


R(h) <_ min (R(hj) + 2cs(T —i+1)). 


Use this property to complete the proof of (7.30). 


8 Multi-Class Classification 


The classification problems we examined in the previous chapters were all binary. 
However, in most real-world classification problems the number of classes is greater 
than two. The problem may consist of assigning a topic to a text document, a 
category to a speech utterance or a function to a biological sequence. In all of these 
tasks, the number of classes may be on the order of several hundred or more. 

In this chapter, we analyze the problem of multi-class classification. We first in- 
troduce the multi-class classification learning problem and discuss its multiple set- 
tings, and then derive generalization bounds for it using the notion of Rademacher 
complexity. Next, we describe and analyze a series of algorithms for tackling the 
multi-class classification problem. We will distinguish between two broad classes 
of algorithms: uncombined algorithms that are specifically designed for the multi- 
class setting such as multi-class SVMs, decision trees, or multi-class boosting, and 
aggregated algorithms that are based on a reduction to binary classification and re- 
quire training multiple binary classifiers. We will also briefly discuss the problem of 
structured prediction, which is a related problem arising in a variety of applications. 


8.1 Miulti-class classification problem 


Let ¥ denote the input space and Y denote the output space, and let D be an 
unknown distribution over ¥ according to which input points are drawn. We will 
distinguish between two cases: the mono-label case, where y is a finite set of classes 
that we mark with numbers for convenience, Y = {1,...,k}, and the multi-label 
case where Y = {—1,+1}*. In the mono-label case, each example is labeled with a 
single class, while in the multi-label case it can be labeled with several. The latter 
can be illustrated by the case of text documents, which can be labeled with several 
different relevant topics, e.g., sports, business, and society. The positive components 
of a vector in {—1,+1}* indicate the classes associated with an example. 

In either case, the learner receives a labeled sample S = ((11,41),---,(@m+Ym)) € 
(YX x Y)™ with 21,...,@m drawn i.i.d. according to D, and y; = f(a;) for all 
i € [l,m], where f: ¥ — ) is the target labeling function. Thus, we consider a 
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deterministic scenario, which, as discussed in section 2.4.1, can be straightforwardly 
extended to a stochastic one where we have a distribution over V x Y. 

Given a hypothesis set H of functions mapping ¥ to Y, the multi-class classifi- 
cation problem consists of using the labeled sample S to find a hypothesis h © H 
with small generalization error R(h) with respect to the target /f: 


R(h) = E [ner] mono-label case (8.1) 
k 

R(h) = E, [>> Liaw) 4lf(@) |r multi-label case. (8.2) 
l=1 


The notion of Hamming distance dy, that is, the number of corresponding compo- 
nents in two vectors that differ, can be used to give a common formulation for both 
errors: 


R(h) = EB, [du(h(2), f(e))]. (8.3) 


a~D 


The empirical error of h € H is denoted by R(h) and defined by 
a Li 
R(h) = a S- di (h(i), yi) - (8.4) 
i=1 


Several issues, both computational and learning-related, often arise in the multi- 
class setting. Computationally, dealing with a large number of classes can be 
problematic. The number of classes k directly enters the time complexity of the 
algorithms we will present. Even for a relatively small number of classes such as 
k = 100 or k = 1,000, some techniques may become prohibitive to use in practice. 
This dependency is even more critical in the case where k& is very large or even 
infinite as in the case of some structured prediction problems. 

A learning-related issue that commonly appears in the multi-class setting is the 
existence of unbalanced classes. Some classes may be represented by less than 5 
percent of the labeled sample, while others may dominate a very large fraction 
of the data. When separate binary classifiers are used to define the multi-class 
solution, we may need to train a classifier distinguishing between two classes with 
only a small representation in the training sample. This implies training on a small 
sample, with poor performance guarantees. Alternatively, when a large fraction 
of the training instances belong to one class, it may be tempting to propose a 
hypothesis always returning that class, since its generalization error as defined 
earlier is likely to be relatively low. However, this trivial solution is typically not the 
one intended. Instead, the loss function may need to be reformulated by assigning 
different misclassification weights to each pair of classes. 

Another learning-related issue is the relationship between classes, which can 
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be hierarchical. For example, in the case of document classification, the error of 
misclassifying a document dealing with world politics as one dealing with real 
estate should naturally be penalized more than the error of labeling a document 
with sports instead of the more specific label baseball. Thus, a more complex and 
more useful multi-class classification formulation would take into consideration the 
hierarchical relationships between classes and define the loss function in accordance 
with this hierarchy. More generally, there may be a graph relationship between 
classes as in the case of the GO ontology in computational biology. The use of 
hierarchical relationships between classes leads to a richer and more complex multi- 
class classification problem. 


8.2. Generalization bounds 


In this section, we present margin-based generalization bounds for multi-class 
classification in the mono-label case. In the binary setting, classifiers are often 
defined based on the sign of a scoring function. In the multi-class setting, a 
hypothesis is defined based on a scoring function h: 4 x VY — R. The label associated 
to point is the one resulting in the largest score h(x, y), which defines the following 
mapping from ¥ to y: 


xr argmax h(a, y). 
yey 
This naturally leads to the following definition of the margin pp(a,y) of the function 
h at a labeled example (x, y): 


pr(x,y) = A(z, y) — max h(x, y’). 
y' AY 


Thus, h misclassifies (x, y) iff pp (x,y) < 0. For any p > 0, we can define the empirical 
margin loss of a hypothesis h for multi-class classification as 


m 


Zs 1 
Rp(h) = in > ®,(pn(xis Yi)) (8.5) 
i=1 
where ®, is the margin loss function (definition 4.3). Thus, the empirical margin 
loss for multi-class classification is upper bounded by the fraction of the training 
points misclassified by h or correctly classified but with confidence less than or equal 
to p: 


— 1 m 
Roh) S$ —) Lon (eiudse: (8.6) 
i=1 
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The following lemma will be used in the proof of the main result of this section. 


Lemma 8.1 

Let Fy,...,F, bel hypothesis sets in R*,1> 1, and let G = {max{hy,..., hi}: hi € 
Fi,i € [1,0}. Then, for any sample S of size m, the empirical Rademacher 
complexity of G can be upper bounded as follows: 


I 
Rs(G) < DU Rs(F)). (8.7) 
j=l 
Proof Let S = (x,...,2%m) be a sample of size m. We first prove the result in 


the case | = 2. By definition of the max operator, for any hy € F, and hg € Fa, 


1 
max{h,, hg} = (hy t ho t |hy hol}. 
2 


Thus, we can write: 


m 


A 1 
Rs(G) = —z| sup Soo: max{h(a;), h2(a;)} 
mo hiE€Fy . 
ho€Fe2 ae 
1 m 
= ama B Loup, 2 ov (Pales) + halos) + 10 — hated) 
ho€F2 
< Afis(Fi) + 2Ro(Fa) + + B[ sup Dail —M(e)l], 88) 
a) SI 1 2 S\F2 Naa ment 1 2 i)|}> . 
2€F2 


using the sub-additivity of sup. Since x + || is 1-Lipschitz, by Talagrand’s lemma 
(lemma 4.2), the last term can be bounded as follows 


1 m i m 
=—E i|(ha — he)(#s)|| < ——E i(hy — he) (ay 
sma BL sup, Lael — hayes] < 5 BL sup Dolls — hao) 
ho€F2 h2€F2 
ls 1 
< ~fs(F:) + —-E ~ojha(a; 
< 5hs(Fi) ora | 802, ojho(x | 
le le 
= sits (Fi) + 5Rs(F2), (8.9) 


where we again use the sub-additivity of sup for the second inequality and the fact 
that o; and —o; have the same distribution for any 7 € {1,m] for the last equality. 
Combining (8.8) and (8.9) yields ®5(G) < Rs(F1) + Rs (Fo). The general case can 
be derived from the case | = 2 using max{h,,...,h;} = max{h,,max{hg,...,hi}} 
and an immediate recurrence. m 
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For any family of hypotheses mapping V x Y to R, we define II,(H) by 
Il (A) = {x h(a, y): yEV, AE A}. 
The following theorem gives a general margin bound for multi-class classification. 


Theorem 8.1 Margin bound for multi-class classification 

Let H C R**” be a hypothesis set with Y = {1,...,k}. Fir p > 0. Then, for 
any 0 > 0, with probability at least 1 — 6, the following multi-class classification 
generalization bound holds for all h € H: 


(8.10) 


Proof The first part of the proof is similar to that of theorem 4.4. Let H be 
the family of hypotheses mapping V x Y to R defined by H = {z = (2,y) > 
pr(x,y): h € H}. Consider the family of functions H = {®, or: r € H} derived 


from H, which take values in [0,1]. By theorem 3.1, with probability at least 1 — 0, 
for allh € H, 


E[®,(pn(2,y))] < Rp(h) + 28m(®, 0H) + 


Since 1y<o < ®,(w) for all u € R, the generalization error R(h) is a lower bound on 
the left-hand side, R(h) = E[lytn(a)—n(aj<ol < E[®p(pn(x,y))], and we can write: 


log + 
2m ~ 


R(h) < Rp(h) + 28m(®, 0 H) + i 


As in the proof of theorem 4.4, we can show that Ky, (®, fe) H) = 
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the (1/p)-Lipschitzness of ®,. Here, R,,(H H) can be upper bounded as follows: 


m 


~ 1 
Rm(H) = A ha [sup Do opal 1) 
1 
= m8 B [sup 1 aipn( xi,y)l y= Al 
So lpen ‘ tye 
1 
<= B | on(ai,y)1 b-additivity of 
So sup Yo avens Pyle (sub-additivity of sup) 
yey 
1 2(1y uj) 1 1 
== S- Eg [sup Soul a;,y)(— e+ :)| 
yey 
1 
eae B | spiltie Stil 
Som Xe $ sup Yo eueen (2) + (c= (yu) = 1) 
yey 
1 
tn a: » [ sup Soir ;,9)| (sub-additivity of sup) 
1 
=— B | sup OiPh (Xi, |. 
m2 Be sup = pris) 


where by definition «; € {—1,+1} and we use the fact that o; and oje; have the 
same distribution. 

Let Ily(H)%-) = {max{hi,..., hi}: hi € Th(H),i € [1,k — 1]}. Now, rewriting 
Pn(@i,y) explicitly, using again the sub-additivity of sup, observing that —o; and 
o; are distributed in the same way, and using lemma 8.1 leads to 


IA 


Rm (H) => E [ sup 7 oi(h(es,9) ~ max h(2i,y'))| 


YE 
er 1 m 
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Mm 5.0 | pet (H) iy = 
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hel, (H) 5— 


This concludes the proof. m 
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These bounds can be generalized to hold uniformly for all p > 0 at the cost of 
an additional term \/ (log log,(2/p))/m, as in theorem 4.5 and exercise 4.2. As for 
other margin bounds presented in previous sections, they show the conflict between 
two terms: the larger the desired pairwise ranking margin p, the smaller the middle 
term, at the price of a larger empirical multi-class classification margin loss R,. Note, 
however, that here there is additionally a quadratic dependency on the number of 
classes k. This suggests weaker guarantees when learning with a large number of 
classes or the need for even larger margins p for which the empirical margin loss 
would be small. 

For some hypothesis sets, a simple upper bound can be derived for the 
Rademacher complexity of I[,(H), thereby making theorem 8.1 more explicit. We 
will show this for kernel-based hypotheses. Let kK: ¥ x X — R be a PDS kernel and 
let 6: YX — H bea feature mapping associated to K. In multi-class classification, a 
kernel-based hypothesis is based on k weight vectors w1,...,w, € H. Each weight 
vector wy, | € [1,k], defines a scoring function x ++ w;-®(2) and the class associated 
to point x € ¥ is given by 


argmax w, - ®(2). 
yey 
We denote by W the matrix formed by these weight vectors: W = (w],...,w,)! 
and for any p > 1 denote by ||W||z,, the Lu,, group norm of W defined by 


k 
[Wlap = (= Iwill)”. 
l=1 


For any p > 1, the family of kernel-based hypotheses we will consider ist 
Hr» ={(a,y) EX x {1,...,k} Oe wy B(x): W = (wy,..., wx)! , |Win < A}. 


Proposition 8.1 Rademacher complexity of multi-class kernel-based hy- 
potheses 

Let K: ¥ x & — R be a PDS kernel and let ®: X — HH be a feature mapping 
associated to K. Assume that there exists r > 0 such that K(a,x) < r? for all 
xé€X. Then, for anym > 1, *,(i(Hx,p)) can be bounded as follows: 


r2 A 
Rm (Thi (Hixp)) S 
m 
Proof Let S = (21,...,2%m) denote a sample of size m. Observe that for all 


1. The hypothesis set H can also be defined via H = {h € R**”: h(-,y) CHA |lAllnp < 
A}, where ||h|| x,» = (24 nC, wR)”, without referring to a feature mapping for K. 
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condition ||W||m,, < A implies that ||wz||m < A for all J € [1,4]. In view of that, 
the Rademacher complexity of the hypothesis set Ili(Hx,,) can be expressed and 
bounded as follows: 


l € [1,k], the inequality ||wi|lq < (So e4 lw ll) /” = ||W||m, holds. Thus, the 


Rs (Wi (Hx»)) = ' E | aap (7s, 7@(2))| 
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which concludes the proof. 
Combining theorem 8.1 and proposition 8.1 yields directly the following result. 


Corollary 8.1 Margin bound for multi-class classification with kernel- 
based hypotheses 

Let K: ¥ x & — R be a PDS kernel and let ®: X — HI be a feature mapping 
associated to K. Assume that there exists r > 0 such that K(x,x) < r? for all 
xe X. Fix p> 0. Then, for any 6 > 0, with probability at least 1 — 6, the following 
multi-class classification generalization bound holds for all h € Hx p: 


a 2A2 2 ] i 
R(h) < Rp(h) + 2k)" at Lae (os (8.11) 
m 2m 


In the next two sections, we describe multi-class classification algorithms that 


belong to two distinct families: uncombined algorithms, which are defined by a single 
optimization problem, and aggregated algorithms, which are obtained by training 
multiple binary classifications and by combining their outputs. 
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8.3. Uncombined multi-class algorithms 


In this section, we describe three algorithms designed specifically for multi-class 
classification. We start with a multi-class version of SVMs, then describe a boosting- 
type multi-class algorithm, and conclude with decision trees, which are often used 
as base learners in boosting. 


8.3.1 Multi-class SVMs 


We describe an algorithm that can be derived directly from the theoretical guar- 
antees presented in the previous section. Proceeding as in section 4.4 for classifi- 
cation, the guarantee of corollary 8.1 can be expressed as follows: for any 6 > 0, 
with probability at least 1 — 06, for all hh € Hx 2 = {(xt,y) > w,- ®(x): W = 
(wi, see Wa)"; Sea || wil|? < AP}; 


l— r2 A2 log 4 
R(h) < — > i + Ak? 4 / = a (8.12) 
1=1 


where €; = max (1 — [wy, - ®(;) — maxyzy, Wy - ®(x;)],0) for all i € [1, m]. 
An algorithm based on this theoretical guarantee consists of minimizing the 


right-hand side of (8.12), that is, minimizing an objective function with a term 
corresponding to the sum of the slack variables €;, and another one minimizing 
||W||u,2 or equivalently 7*_, ||wi||?. This is precisely the optimization problem 
defining the multi-class SVM algorithm: 


k m 
1 2 
min — wi || +C i 
we > I|w2 || ds 
subject to: Vi € [1,m], Vl € Y — {yi}, 


The decision function learned is of the form x +> argmax;ey w; - ®(x). As with 
the primal problem of SVMs, this is a convex optimization problem: the objective 
function is convex, since it is a sum of convex functions, and the constraints are 
affine and thus qualified. The objective and constraint functions are differentiable, 
and the KKT conditions hold at the optimum. Defining the Lagrangian and applying 
these conditions leads to the equivalent dual optimization problem, which can be 
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expressed in terms of the kernel function K alone: 


m 1@ 
wenn é Cee) [even E(a2) 
1 et 


subject to: 0 <a; < CA a;-1=0,Vi € [1,m]. 


Here, a € R™** is a matrix, a; denotes the ith row of a, and e; the Ith unit vector 
in R*, 1 € [1,k]. Both the primal and dual problems are simple QPs generalizing 
those of the standard SVM algorithm. However, the size of the solution and the 
number of constraints for both problems is in Q(mk), which, for a large number of 
classes k, can make it difficult to solve. However, there exist specific optimization 
solutions designed for this problem based on a decomposition of the problem into 
m disjoint sets of constraints. 


8.3.2 Multi-class boosting algorithms 


We describe a boosting algorithm for multi-class classification called AdaBoost.MH, 
which in fact coincides with a special instance of AdaBoost. An alternative multi- 
class classification algorithm based on similar boosting ideas, AdaBoost.MR, is 
described and analyzed in exercise 9.5. AdaBoost.MH applies to the multi-label 
setting where Y = {—1,+1}*. As in the binary case, it returns a convex combination 
of base classifiers selected from a hypothesis set H. Let F' be the following objective 
function defined for all samples S = ((%1,y1),---,(@m,;Ym)) € (& x Y)™ and 
a = (a1,..-,An) € R", n> 1, by 


F(a) = y 3 e Villon (eit) = so > e Yl ten ashe (wed) (8.13) 


i=1 l=1 i=1 l=1 


where gn = >>), arht and where y;{I] denotes the Ith coordinate of y; for any 
i € [l,m] and I € [1,k]. F is a convex and differentiable upper bound on the 
multi-class multi-label loss: 


m ek m ik 
sie eee (8.14) 


i=1 l=1 i=1 l=1 


since for any « € 4X with label y = f(a) and any / € [1,k], the inequality 
Ly 4on(al) S e¥lNg9n(#) holds. AdaBoost.MH coincides exactly with the appli- 
cation of coordinate descent to the objective function F. Figure 8.1 gives the 
pseudocode of the algorithm in the case where the base classifiers are functions 
mapping from ¥ x Y to {—1,+1}. The algorithm takes as input a labeled sam- 
ple S = ((11,y1),---;(@m,Ym)) € (X x Y)™ and maintains a distribution D; over 
{1,...,m}x Y. The remaining details of the algorithm are similar to AdaBoost. In 
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ApABoost.MH(S = ((#1, 41); -+++(@m+¥m))) 


1 fori<1to mdo 


Z, — 2e(1 —e)]2 > normalization factor 


for i 1to mdo 


2 for 11 to k do 

3 DiGi, — 4 

4 fort—1to T do 

5 hy — base classifier in H with small error €¢ = Prop, [he (xi, !) A yi[ll] 
6 aE s log a 

7 

8 

9 


for!<—1tokdo 

10 Disili,l) ¢ Pektt expt seule te) 
ll ge Sear arhy 

12 return h = sgn(g) 


Figure 8.1 AdaBoost.MH algorithm, for H C ({—1,+1}*)**”. 


fact, AdaBoost.MH exactly coincides with AdaBoost applied to the training sam- 
ple derived from S by splitting each labeled point (2;, y;) into k labeled examples 
((a;,1), y;[l]), with each example (x;,/) in Y x Y and its label in {—1, +1}: 


(ri, Yi) = ((zi,1), yi[1]), - : -((xi,k), yslK]), 7 € [1, m]. 


Let 5S’ denote the resulting sample, then S’ = ((x1,1), yi[1]),-..,(@m,), Ym[A]))- 
S’ contains mk examples and the expression of the objective function F’ in (8.13) 
coincides exactly with that of the objective function of AdaBoost for the sample $’. 
In view of this connection, the theoretical analysis along with the other observations 
we presented for AdaBoost in chapter 6 also apply here. Hence, we will focus on 
aspects related to the computational efficiency and to the weak learning condition 
that are specific to the multi-class scenario. 

The complexity of the algorithm is that of AdaBoost applied to a sample of 
size mk. For X C RY, using boosting stumps as base classifiers, the complexity of 
the algorithm is therefore in O((mk) log(mk) + mkNT). Thus, for a large number 
of classes k, the algorithm may become impractical using a single processor. The 
weak learning condition for the application of AdaBoost in this scenario requires 
that at each round there exists a base classifier hy: ¥ x Y > {—1,+1} such that 
Proay~p,lhe(ai,!) A yill]] < 1/2. This may be hard to achieve if classes are close 
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Ro 


Figure 8.2 Left: example of a decision tree with numerical questions based on two 
variables X; and X2. Here, each leaf is marked with the region it defines. The class 
labeling for a leaf is obtained via majority vote based on the training points falling 
in the region it defines. Right: Partition of the two-dimensional space induced by 
that decision tree. 


and it is difficult to distinguish between them. It is also more difficult in this context 
to come up with “rules of thumb” h; defined over ¥ x ¥. 


8.3.3 Decision trees 


We present and discuss the general learning method of decision trees that can 
be used in multi-class classification, but also in other learning problems such as 
regression (chapter 10) and clustering. Although the empirical performance of 
decision trees often is not state-of-the-art, decision trees can be used as weak learners 
with boosting to define effective learning algorithms. Decision trees are also typically 
fast to train and evaluate and relatively easy to interpret. 


Definition 8.1 Binary decision tree 

A binary decision tree is a tree representation of a partition of the feature space. 
Figure 8.2 shows a simple example in the case of a two-dimensional space based 
on two features X; and X2, as well as the partition it represents. Each interior 
node of a decision tree corresponds to a question related to features. It can be a 
numerical question of the form X; <a for a feature variable X;, i € [1,N], and 
some threshold a € R, as in the example of figure 8.2, or a categorical question 
such as X; € {blue, white, red}, when feature X; takes a categorical value such as a 
color. Each leaf is labeled with a labell € Y. 


Decision trees can be defined using more complex node questions, resulting in 
partitions based on more complex decision surfaces. For example, binary space 
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GREEDYDECISIONTREES(S = ((21,41),---;(@m;Ym))) 


— 


tree — {no} P root node. 

2 fort<1toT do 

3 (ni, 4) — argmin”, 4) F(n,q) 
4 SPLIT (tree, nz, qz) 
5 


return tree 


Figure 8.3 Greedy algorithm for building a decision tree from a labeled sample S. 
The procedure SPuit(tree, nz, qz) splits node ny by making it an internal node with 
question q: and leaf children n_(n,q) and n;(n,q), each labeled with the dominating 
class of the region it defines, with ties broken arbitrarily. 


partition (BSP) trees partition the space with convex polyhedral regions, based 
on questions of the form 7)", a;X; < a, and sphere trees partition with pieces 
of spheres based on questions of the form ||X — ao|| < a, where X is a feature 
vector, ap a fixed vector, and a is a fixed positive real number. More complex 
tree questions lead to richer partitions and thus hypothesis sets, which can cause 
overfitting in the absence of a sufficiently large training sample. They also increase 
the computational complexity of prediction and training. Decision trees can also 
be generalized to branching factors greater than two, but binary trees are most 
commonly used due to computational considerations. 

Prediction/partitioning: To predict the label of any point « € ¥ we start 
at the root node of the decision tree and go down the tree until a leaf is found, 
by moving to the right child of a node when the response to the node question is 
positive, and to the left child otherwise. When we reach a leaf, we associate x with 
the label of this leaf. 

Thus, each leaf defines a region of ¥ formed by the set of points corresponding 
exactly to the same node responses and thus the same traversal of the tree. By 
definition, no two regions intersect and all points belong to exactly one region. 
Thus, leaf regions define a partition of V, as shown in the example of figure 8.2. In 
multi-class classification, the label of a leaf is determined using the training sample: 
the class with the majority representation among the training points falling in a 
leaf region defines the label of that leaf, with ties broken arbitrarily. 

Learning: We will discuss two different methods for learning a decision tree 
using a labeled sample. The first method is a greedy technique. This is motivated 
by the fact that the general problem of finding a decision tree with the smallest 
error is NP-hard. The method consists of starting with a tree reduced to a single 
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(root) node, which is a leaf whose label is the class that has majority over the 
entire sample. Next, at each round, a node nm; is split based on some question 
q:. The pair (nz, q¢) is chosen so that the node impurity is maximally decreased 
according to some measure of impurity F'. We denote by F(n) the impurity of n. 
The decrease in node impurity after a split of node n based on question q is defined 
as follows. Let ni(n,q) denote the right child of n after the split, n_(q,n) the 
left child, and 7(n,q) the fraction of the points in the region defined by n that are 
moved to n_(n,q). The total impurity of the leaves n_(n,q) and n,(n,q) is therefore 
n(n, q)F(n_(n,q)) + (1 —n(n, q))F(n4(n,q)). Thus, the decrease in impurity F(n, q) 
by that split is given by 


F(n,q) = F(n) — [n(n, q)F (n(n, q)) + (1 = n(n, q))F(n4(n, q))]- 


Figure 8.3 shows the pseudocode of this greedy construction based on F.In practice, 
the algorithm is stopped once all nodes have reached a sufficient level of purity, when 
the number of points per leaf has become too small for further splitting or based 
on some other similar heuristic. 

For any node n and class / € [1, k], let p;(n) denote the fraction of points at n that 
belong to class J. Then, the three most commonly used measures of node impurity 
F are defined as follows: 


1 — maxjei,4) pi(n) misclassification; 
F(n) = 4 — Ly pin) logy pi(n) entropy; 
oe pu(n)(1—pi(n)) — Gini index. 


Figure 8.4 illustrates these definitions in the special cases of two classes (k = 2). The 
entropy and Gini index impurity functions are upper bounds on the misclassification 
impurity function. All three functions are convex, which ensures that 


F(n) — [n(n, q)F(n_(n, q)) + 1 — n(n, 4) F'(n4(n, q))] = 0. 


However, the misclassification function is piecewise linear, so F (n,q) is zero if the 
fraction of positive points remains less than (or more than) half after a split. In 
some cases, the impurity cannot be decreased by any split using that criterion. In 
contrast, the entropy and Gini functions are strictly convex, which guarantees a 
strict decrease in impurity. Furthermore, they are differentiable which is a useful 
feature for numerical optimization. Thus, the Gini index and the entropy criteria 
are typically preferred in practice. 

The greedy method just described faces some issues. One issue relates to the 
greedy nature of the algorithm: a seemingly bad split may dominate subsequent 
useful splits, which could lead to trees with less impurity overall. This can be 
addressed to a certain extent by using a look-ahead of some depth d to determine 
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impurity 


Figure 8.4 Node impurity plotted as a function of the fraction of positive examples 
in the binary case: misclassification (in black), entropy (in green, scaled by .5 to set 
the maximum to the same value for all three functions), and the Gini index (in red). 


the splitting decisions, but such look-aheads can be computationally very costly. 
Another issue relates to the size of the resulting tree. To achieve some desired level 
of impurity, trees of relatively large sizes may be needed. But larger trees define 
overly complex hypotheses with high VC-dimensions (see exercise 9.6) and thus 
could overfit. 

An alternative method for learning decision trees using a labeled training sample 
is based on the so-called grow-then-prune strategy. First a very large tree is grown 
until it fully fits the training sample or until no more than a very small number of 
points are left at each leaf. Then, the resulting tree, denoted as tree, is pruned back 
to minimize an objective function defined based on generalization bounds as the 
sum of an empirical error and a complexity term that can be expressed in terms of 
the size of tree, the set of leaves of tree: 


G)(tree) = x |n|F(n) + Altree|. (8.15) 


n€tree 


A > 0 is a regularization parameter determining the trade-off between misclassifi- 
cation, or more generally impurity, versus tree complexity. For any tree tree’, we 
denote by R(tree’) the total empirical error )>,cie’ |n|F'(n). We seek a sub-tree 
tree, of tree that minimizes G and that has the smallest size. tree, can be shown 
to be unique. To determine tree), the following pruning method is used, which de- 
fines a finite sequence of nested sub-trees tree), ..., tree”). We start with the full 
tree tree) = tree and for any i € [0,n—1], define tree’+) from tree) by collapsing 
an internal node n’ of tree), that is by replacing the sub-tree rooted at n’ with a 
leaf, or equivalently by combining the regions of all the leaves dominated by n’. n’ 
is chosen so that collapsing it causes the smallest per node increase in R(tree(), 
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that is the smallest r(tree,n’) defined by 


"PE / _R ‘ , 
r(tree™, n’) = [n’| Ek ( iil 
|tree | —1 


where n’ is an internal node of tree. If several nodes n’ in tree) cause the same 
smallest increase per node r(tree™, n’), then all of them are pruned to define tree(’+)) 
from tree“). This procedure continues until the tree tree”) obtained has a single 
node. The sub-tree tree, can be shown to be among the elements of the sequence 
tree)... tree(”). The parameter \ is determined via n-fold cross-validation. 

Decision trees seem relatively easy to interpret, and this is often underlined as 
one of their most useful features. However, such interpretations should be carried 
out with care since decision trees are unstable: small changes in the training data 
may lead to very different splits and thus entirely different trees, as a result of their 
hierarchical nature. Decision trees can also be used in a natural manner to deal 
with the problem of missing features , which often appears in learning applications; 
in practice, some features values may be missing because the proper measurements 
were not taken or because of some noise source causing their systematic absence. In 
such cases, only those variables available at a node can be used in prediction. Finally, 
decision trees can be used and learned from data in a similar way in regression (see 
chapter 10).? 


8.4 Aggregated multi-class algorithms 


In this section, we discuss a different approach to multi-class classification that 
reduces the problem to that of multiple binary classification tasks. A binary clas- 
sification algorithm is then trained for each of these tasks independently, and the 
multi-class predictor is defined as a combination of the hypotheses returned by each 
of these algorithms. We first discuss two standard techniques for the reduction of 
multi-class classification to binary classification, and then show that they are both 
special instances of a more general framework. 


8.4.1 One-versus-all 


Let S = ((1,Y1),---;%m,Ym)) € (¥ x Y)™ be a labeled training sample. A 
straightforward reduction of the multi-class classification to binary classification 


2. The only changes to the description for classification are the following. For prediction, 
the label of a leaf is defined as the mean squared average of the labels of the points falling 
in that region. For learning, the impurity function is the mean squared error. 
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is based on the so-called one-versus-all (OVA) or one-versus-the-rest technique. 
This technique consists of learning k binary classifiers hj: ¥ > {—1,+1}, 1 € Y, 
each seeking to discriminate one class | € Y from all the others. For any | € Y, hy 
is obtained by training a binary classification algorithm on the full sample S after 
relabeling points in class / with 1 and all others with —1. For | € Y, assume that 
h; is derived from the sign of a scoring function f;: ¥ — R, that is h; = sgn(f;), as 
in the case of many of the binary classification algorithms discussed in the previous 
chapters. Then, the multi-class hypothesis h: 4 — Y defined by the OVA technique 
is given by: 


Vee X, h(x) = argmax fi(2). (8.16) 
leY 

This formula may seem similar to those defining a multi-class classification hypoth- 
esis in the case of uncombined algorithms. Note, however, that for uncombined 
algorithms the functions f; are learned together, while here they are learned in- 
dependently. Formula (8.16) is well-founded when the scores given by functions f; 
can be interpreted as confidence scores, that is when f(a) is learned as an esti- 
mate of the probability of « conditioned on class 1. However, in general, the scores 
given by functions f;, | € Y, are not comparable and the OVA technique based 
on (8.16) admits no principled justification. This is sometimes referred to as a cal- 
ibration problem. Clearly, this problem cannot be corrected by simply normalizing 
the scores of each function to make their magnitudes uniform, or by applying other 
similar heuristics. When it is justifiable, the OVA technique is simple and its com- 
putational cost is k times that of training a binary classification algorithm, which 
is similar to the computation costs for many uncombined algorithms. 


8.4.2 One-versus-one 


An alternative technique, known as the one-versus-one (OVO) technique, consists 
of using the training data to learn (independently), for each pair of distinct classes 
(,l’) € Y?, 1 #V, a binary classifier hy: & — {—1,1} discriminating between 
classes / and I’. For any (1, 1’) € Y?, hi is obtained by training a binary classification 
algorithm on the sub-sample containing exactly the points labeled with 1 or I, 
with the value +1 returned for class l’ and —1 for class 1. This requires training 
ey = k(k—1)/2 classifiers, which are combined to define a multi-class classification 
hypothesis h via majority vote: 


Vae xX, h(a) = argmax | {2: hw (x) = 1}. (8.17) 
Vey 


Thus, for a fixed point « € 4, if we describe the prediction values hj (x) as the 
results of the matches in a tournament between two players / and I’, with hy (x) = 1 


200 Multi-Class Classification 


Training Testing 
OVA O(km*) O(kcz) 
OVO | Ok? -*m™) | Ofk' eG) 


Table 8.1 Comparison of the time complexity the OVA and OVO techniques for 
both training and testing. The table assumes a full training sample of size m with 
each class represented by m/k points. The time for training a binary classification 
algorithm on a sample of size n is assumed to be in O(n*). Thus, the training time 
for the OVO technique is in O(k?(m/k)*) = O(k?~“m*). c; denotes the cost of testing 
a single classifier. 


indicating l’ winning over I, then the class predicted by h can be interpreted as the 
one with the largest number of wins in that tournament. 

Let x € X be a point belonging to class I’. By definition of the OVO technique, 
if hy (a) = 1 for alll 4 I’, then the class associated to x by OVO is the correct 
class I’ since |{2: hw (x) = 1}| = k—1 and no other class can reach (k — 1) wins. By 
contraposition, if the OVO hypothesis misclassifies x, then at least one of the (k—1) 
binary classifiers hy, 1 4 I’, incorrectly classifies x. Assume that the generalization 
error of all binary classifiers hj; used by OVO is at most r, then, in view of this 
discussion, the generalization error of the hypothesis returned by OVO is at most 
(k —1)r. 

The OVO technique is not subject to the calibration problem pointed out in the 
case of the OVA technique. However, when the size of the sub-sample containing 
members of the classes | and I’ is relatively small, hj may be learned without 
sufficient data or with increased risk of overfitting. Another concern often raised for 
the use of this technique is the computational cost of training k(k — 1)/2 binary 
classifiers versus that of the OVA technique. 

Taking a closer look at the computational requirements of these two methods 
reveals, however, that the disparity may not be so great and that in fact under 
some assumptions the time complexity of training for OVO could be less than that 
of OVA. Table 8.1 compares the computational complexity of these methods both 
for training and testing assuming that the complexity of training a binary classifier 
on a sample of size m is in O(m“) and that each class is equally represented in 
the training set, that is by m/k points. Under these assumptions, if a € [2,3) as in 
the case of some algorithms solving a QP problem, such as SVMs, then the time 
complexity of training for the OVO technique is in fact more favorable than that 
of OVA. For a = 1, the two are comparable and it is only for sub-linear algorithms 
that the OVA technique would benefit from a better complexity. In all cases, at test 
time, OVO requires k(k—1)/2 classifier evaluations, which is (k—1) times more than 
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OVA. However, for some algorithms the evaluation time for each classifier could be 
much smaller for OVO. For example, in the case of SVMs, the average number of 
support vectors may be significantly smaller for OVO, since each classifier is trained 
on a significantly smaller sample. If the number of support vectors is & times smaller 
and if sparse feature representations are used, then the time complexities of both 
techniques for testing are comparable. 


8.4.3 Error-correction codes 


A more general method for the reduction of multi-class to binary classification is 
based on the idea of error-correction codes (ECOC). This technique consists of 
assigning to each class | € Y a code word of length c > 1, which in the simplest case 
is a binary vector M; € {—1,+1}°. M; serves as a signature for class /, and together 
these vectors define a matrix M € {—1,+1}**¢ whose Ith row is My, as illustrated 
by figure 8.5. Next, for each column j € [1,c], a binary classifier h;: ¥ — {—1,+1} 
is learned using the full training sample S, after relabeling points that belong to 
a class of column / labeled with +1, and all others with —1. For any x € %, let 
h(x) denote the vector h(x) = (hi(z),...,he(x))'. Then, the multi-class hypothesis 
h: X — > is defined by 


Vee X, h(x) = argmaxdy(Mi,h(z)). (8.18) 
ley 
Thus, the class predicted is the one whose signatures is the closest to h(a) in 
Hamming distance. Figure 8.5 illustrates this definition: no row of matrix M 
matches the vector of predictions h(a) in that case, but the third row shares the 
largest number of components with h(). 

The success of the ECOC technique depends on the minimal Hamming distance 
between the class code words. Let d denote that distance, then up to ro = |S+| 
binary classification errors can be corrected by this technique: by definition of d, 
even if r < ro binary classifiers h; misclassify « € ¥, h(x) is closest to the code 
word of the correct class of 7. For a fixed c, the design of error-correction matrix 
M is subject to a trade-off, since larger d values may imply substantially more 
difficult binary classification tasks. In practice, each column may correspond to a 
class feature determined based on domain knowledge. 

The ECOC technique just described can be extended in two ways. First, instead 
of using only the label predicted by each classifier hy the magnitude of the scores 
defining h; is used. Thus, if h; = sgn(f:) for some function f; whose values can 
be interpreted as confidence scores, then the multi-class hypothesis h: ¥ — Y is 
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8/0} 1]; 0] 1, 0/0 
Figure 8.5 Illustration of error-correction codes for multi-class classification. Left: 
binary code matrix M, with each row representing the code word of length c = 6 


of a class | € [1,8]. Right: vector of predictions h(x) for a test point x. The ECOC 
classifier assigns label 3 to x, since the binary code for the third class yields the 
minimal Hamming distance with h(z) (distance of 1). 


defined by 


Vee X, h(x) = argmin } > L(mi; f;(£)), (8.19) 


ley 54 


where (mj ;) are the entries of M and where L: R — R, is a loss function. When L 
is defined by L(x) = i-sen(e) for all x € & and h; = fi, we can write: 


c 


Yo Homi fy(2)) = > a) — aca, n(a)), 


j=l : 

and (8.19) coincides with (8.18). Furthermore, ternary codes can be used with ma- 
trix entries in {—1,0,+1} so that examples in classes labeled with 0 are disregarded 
when training a binary classifier for each column. With these extensions, both OVA 
and OVO become special instances of the ECOC technique. The matrix M for 
OVA is a square matrix, that is c = k, with all terms equal to —1 except from the 
diagonal ones which are all equal to +1. The matrix M for OVO has c = k(k—1)/2 
columns. Each column corresponds to a pair of distinct classes (1,1’), 1 4 I’, with 
all entries equal to 0 except from the one with row J, which is —1, and the one with 
row I’, which is +1. 

Since the values of the scoring functions are assumed to be confidence scores, 
m1; f;(2) can be interpreted as the margin of classifier 7 on point x and (8.19) is 
thus based on some loss L defined with respect to the binary classifier’s margin. 

A further extension of ECOC consists of extending discrete codes to continuous 
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ones by letting the matrix entries take arbitrary real values and by using the training 
sample to learn matrix M. Starting with a discrete version of M, c binary classifiers 
with scoring functions f;, 1 € [1,c], are first learned as described previously. We will 
denote by F(x) the vector (f1(x),...,fe(x))' for any x € ¥. Next, the entries of 
M are relaxed to take real values and learned from the training sample with the 
objective of making the row of M corresponding to the class of any point x € ¥ 
more similar to F(x) than other rows. The similarity can be measured using any 
PDS kernel AK. An example of an algorithm for learning M using a PDS kernel 
and the idea just discussed is in fact multi-class SVMs, which, in this context, can 
be formulated as follows: 
m 
. 2 
min ||M||z + oD 
subject to: V(i,1) € [1,m] x Y, 


Similar algorithms can be defined using other matrix norms. The resulting multi- 
class classification decision function has the following form: 


h: 2+ argmax K(f£(a), Mz). 
le {1,...,k} 


8.5 Structured prediction algorithms 


In this section, we briefly discuss an important class of problems related to multi- 
class classification that frequently arises in computer vision, computational biology, 
and natural language processing. These include all sequence labeling problems and 
complex problems such as parsing, machine translation, and speech recognition. 

In these applications, the output labels have a rich internal structure. For exam- 
ple, in part-of-speech tagging the problem consists of assigning a part-of-speech tag 
such as N (noun), V (verb), or A (adjective), to every word of a sentence. Thus, the 
label of the sentence w)...w,, made of the words w; is a sequence of part-of-speech 
tags ty ...t,. This can be viewed as a multi-class classification problem where each 
sequence of tags is a possible label. However, several critical aspects common to 
such structured output problems make them distinct from the standard multi-class 
classification. 

First, the label set is exponentially large as a function of the size of the output. 
For example, if © denotes the alphabet of part-of-speech tags, for a sentence of 
length n there are ||" possible tag sequences. Second, there are dependencies 
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between the substructures of a label that are important to take into account for 
an accurate prediction. For example, in part-of-speech tagging, some tag sequences 
may be ungrammatical or unlikely. Finally, the loss function used is typically not a 
zero-one loss but one that depends on the substructures. Let DL: Y x Y — R denote 
a loss function such that L(y’, y) measures the penalty of predicting the label y’ € Y 
instead of the correct label y € Y.? In part-of-speech tagging, L(y’, y) could be for 
example the Hamming distance between y’ and y. 

The relevant features in structured output problems often depend on both the 
input and the output. Thus, we will denote by ®(z,y) € R% the feature vector 
associated to a pair (1, y) € & x Jy. 

To model the label structures and their dependency, the label set ¥V is typically 
assumed to be endowed with a graphical model structure, that is, a graph giving a 
probabilistic model of the conditional dependence between the substructures. It is 
also assumed that both the feature vector ®(x, y) associated to an input « € Y and 
output y € Y and the loss L(y’, y) factorize according to the cliques of that graphical 
model. A detailed treatment of this topic would require a further background in 
graphical models, and is thus beyond the scope of this section. 

The hypothesis set used by most structured prediction algorithms is then defined 
as the set of functions h: Y — Y such that 


VaeE xX, h(x) =argmaxw-: ®(2,y), (8.20) 
yey 
for some vector w € RN. Let S = ((x1,y1),---;2m;Ym)) € (& x Y)™ be an iid. 
labeled sample. Since the hypothesis set is linear, we can seek to define an algorithm 
similar to multi-class SVMs. The optimization problem for multi-class SVMs can 
be rewritten equivalently as follows: 


1 ‘ m 
min =||w||*+C max max (0,1 — w- |®(27;, y;) -—®(2;, ; 8.21 
ain GW? +CD) maxmax (0.1—w-[Beu)—B(e2)]), (8.21 


However, here we need to take into account the loss function L, that is L(y, y;) for 
each 7 € [1,m] and y € Y, and there are multiple ways to proceed. One possible way 
is to let the margin violation be penalized additively with L(y, y;). Thus, in that 
case L(y, yi) is added to the margin violation. Another natural method consists of 
penalizing the margin violation by multiplying it with L(y, y;). A margin violation 
with a larger loss is then penalized more than one with a smaller one. 


3. More generally, in some applications, the loss function could also depend on the input. 
Thus, L is then a function mapping L: Y¥ x Y x Y > R, with L(z,y’, y) measuring the 
penalty of predicting the label y’ instead of y given the input «x. 

4. In an undirected graph, a clique is a set of fully connected vertices. 
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The additive penalization leads to the following algorithm known as Mazimum 
Margin Markov Networks (M°N): 


min zim? 3 max max (0, L(yi.y) —w- [B(ei,9s)-B(es,u)])- (8.22) 


An advantage of this algorithm is that, as in the case of SVMs, it admits a natural 
use of PDS kernels. As already indicated, the label set VY is assumed to be endowed 
with a graph structure with a Markov property, typically a chain or a tree, and 
the loss function is assumed to be decomposable in the same way. Under these 
assumptions, by exploiting the graphical model structure of the labels, a polynomial- 
time algorithm can be given to determine its solution. 

A multiplicative combination of the loss with the margin leads to the following 
algorithm known as SVMStruct: 


1 . m 
min 5||w||?+C )~ max L(y;,y) max (0,1 — w -[®(2i,y:)—B(ai,y)]). (8.28 
ain ||| Lote (yin) ax ( [B(xi, yi)— B(x yl) (8.23) 


This problem can be equivalently written as a QP with an infinite number of 
constraints. In practice, it is solved iteratively by augmenting at each round the 
finite set of constraints of the previous round with the most violating constraint. 
This method can be applied in fact under very general assumptions and for arbitrary 
loss definitions. As in the case of M3N, SVMStruct naturally admits the use of PDS 
kernels and thus an extension to non-linear models for the solution. 

Another standard algorithm for structured prediction problems is Conditional 
Random Fields (CRFs). We will not describe this algorithm in detail, but point 
out its similarity with the algorithms just described, in particular M°N. The 
optimization problem for CRFs can be written as 


; 1 m 
min 5\lw]?+C >> log) exp (L(yi,y) —w[P(ai,y:)—B(ai,y)])- (8.24) 
i=1 yey 


Assume for simplicity that ) is finite and has cardinality k and let f denote the 
function (a1,...,2%) log(7y 1 e*i). f is a convex function known as the soft- 
max, since it provides a smooth approximation of (7,...,v%) > max(a1,..., 2p). 
Then, problem (8.24) is similar to (8.22) modulo the replacement of the max 
operator with the soft-max function just described. 
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8.6 Chapter notes 


The margin-based generalization for multi-class classification presented in theo- 
rem 8.1 is based on an adaptation of the result and proof due to Koltchinskii and 
Panchenko [2002]. Proposition 8.1 bounding the Rademacher complexity of multi- 
class kernel-based hypotheses and corollary 8.1 are new. 

An algorithm generalizing SVMs to the multi-class classification setting was first 
introduced by Weston and Watkins [1999]. The optimization problem for that 
algorithm was based on k(k& — 1)/2 slack variables for a problem with k classes and 
thus could be inefficient for a relatively large number of classes. A simplification of 
that algorithm by replacing the sum of the slack variables }> ivi ij Telated to point 
x; by its maximum €; = max; z;&; considerably reduces the number of variables 
and leads to the multi-class SVM algorithm presented in this chapter [Crammer 
and Singer, 2001, 2002]. 

The AdaBoost.MH algorithm is presented and discussed by Schapire and Singer 
[1999, 2000]. As we showed in this chapter, the algorithm is a special instance 
of AdaBoost. Another boosting-type algorithm for multi-class classification, Ad- 
aBoost.MR, is presented by Schapire and Singer [1999, 2000]. That algorithm is 
also a special instance of the RankBoost algorithm presented in chapter 9. See ex- 
ercise 9.5 for a detailed analysis of this algorithm, including generalization bounds. 

The most commonly used tools for learning decision trees are CART (classification 
and regression tree) [Breiman et al., 1984] and C4.5 [Quinlan, 1986, 1993]. The 
greedy technique we described for learning decision trees benefits in fact from an 
interesting analysis: remarkably, it has been shown by Kearns and Mansour [1999], 
Mansour and McAllester [1999] that, under a weak learner hypothesis assumption, 
such decision tree algorithms produce a strong hypothesis. The grow-then-prune 
method is from CART. It has been analyzed by a variety of different studies, in 
particular by Kearns and Mansour [1998] and Mansour and McAllester [2000], who 
give generalization bounds for the resulting decision trees with respect to the error 
and size of the best sub-tree of the original tree pruned. 

The idea of the ECOC framework for multi-class classification is due to Dietterich 
and Bakiri [1995]. Allwein et al. [2000] further extended and analyzed this method 
to margin-based losses, for which they presented a bound on the empirical error 
and a generalization bound in the more specific case of boosting. While the OVA 
technique is in general subject to a calibration issue and does not have any 
justification, it is very commonly used in practice. Rifkin [2002] reports the results 
of extensive experiments with several multi-class classification algorithms that are 
rather favorable to the OVA technique, with performances often very close or better 
than for those of several uncombined algorithms, unlike what has been claimed by 
some authors (see also Rifkin and Klautau [2004]). 
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The CRFs algorithm was introduced by Lafferty, McCallum, and Pereira [2001]. 
MEN is due to Taskar, Guestrin, and Koller [2003] and StructSVM was presented by 
Tsochantaridis, Joachims, Hofmann, and Altun [2005]. An alternative technique for 
tackling structured prediction as a regression problem was presented and analyzed 
by Cortes, Mohri, and Weston [2007c]. 


8.7 Exercises 


8.1 Generalization bounds for multi-label case. Use similar techniques to those used 
in the proof of theorem 8.1 to derive a margin-based learning bound in the multi- 
label case. 


8.2 Multi-class classification with kernel-based hypotheses constrained by an Ly 
norm. Use corollary 8.1 to define alternative multi-class classification algorithms 
with kernel-based hypotheses constrained by an Lp, norm with p # 2. For which 
value of p > 1 is the bound of proposition 8.1 tightest? Derive the dual optimization 
of the multi-class classification algorithm defined with p = oo. 


8.3 Alternative multi-class boosting algorithm. Consider the objective function 
G defined for any sample S = ((21,41),---;(@m;Ym)) € (¥# x Y)™ and a = 
(Q1,---,Qn) € R", n> 1, by 

=e Eta villlgn (ai, ) = Ste kr yall] ofan oehe (wid) (8.25) 


i=l i=1 


Use the convexity of the exponential function to compare G with the objective func- 
tion F’ defining AdaBoost.MH. Show that G is a convex function upper bounding 
the multi-label multi-class error. Discuss the properties of G and derive an algorithm 
defined by the application of coordinate descent to G. Give theoretical guarantees 
for the performance of the algorithm and analyze its running-time complexity when 
using boosting stumps. 


8.4 Multi-class algorithm based on RankBoost. This problem requires familiarity 
with the material presented both in this chapter and in chapter 9. An alternative 
boosting-type multi-class classification algorithm is one based on a ranking criterion. 
We will define and examine that algorithm in the mono-label setting. Let H be a 
family of base hypothesis mapping 1 x Y to {—1,+1}. Let F be the following 
objective function defined for all samples S' = ((41,y1),---,(®%m;Ym)) € (X x Y)™ 
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and @ = (a1,...,Q@n) € R”, n> 1, by 


F(a) = 3 S> enn Beni) aud) — > S > en Dear aelhe(aesve)—heleed) (8,96) 


i=1 IA yi: i=1 Ay: 


where gn = do y_ athe. 


(a) Show that F’ is convex and differentiable. 

(b) Show that 4 77", 1,,, (ti, yi) < G4 F(@), where gn = Oy, athe. 

(c) Give the pseudocode of the algorithm obtained by applying coordinate 
descent to F’. The resulting algorithm is known as AdaBoost.MR. Show that 
AdaBoost.MR exactly coincides with the RankBoost algorithm applied to the 
problem of ranking pairs (a,y) € ¥ x Y. Describe exactly the ranking target 
for these pairs. 


(d) Use question (8.4b) and the learning bounds of this chapter to derive 
margin-based generalization bounds for this algorithm. 

(e) Use the connection of the algorithm with RankBoost and the learning 
bounds of chapter 9 to derive alternative generalization bounds for this al- 
gorithm. Compare these bounds with those of the previous question. 


8.5 Decision trees. Show that VC-dimension of a binary decision tree with n nodes 
in dimension N is in O(nlog N). 


8.6 Give an example where the generalization error of each of the k(k —1)/2 binary 
classifiers hy, | 4 l', used in the definition of the OVO technique is r and that of 
the OVO hypothesis (& — 1)r. 


9 Ranking 


The learning problem of ranking arises in many modern applications, including 
the design of search engines, information extraction platforms, and movie recom- 
mendation systems. In these applications, the ordering of the documents or movies 
returned is a critical aspect of the system. The main motivation for ranking over 
classification in the binary case is the limitation of resources: for very large data 
sets, it may be impractical or even impossible to display or process all items labeled 
as relevant by a classifier. A standard user of a search engine is not willing to con- 
sult all the documents returned in response to a query, but only the top ten or so. 
Similarly, a member of the fraud detection department of a credit card company 
cannot investigate thousands of transactions classified as potentially fraudulent, but 
only a few dozens of the most suspicious ones. 

In this chapter, we study in depth the learning problem of ranking. We distinguish 
two general settings for this problem: the score-based and the preference-based set- 
tings. For the score-based setting, which is the most widely explored one, we present 
margin-based generalization bounds using the notion of Rademacher complexity. 
We then describe an SVM-based ranking algorithm that can be derived from these 
bounds and describe and analyze RankBoost, a boosting algorithm for ranking. 
We further study specifically the bipartite setting of the ranking problem where, 
as in binary classification, each point belongs to one of two classes. We discuss an 
efficient implementation of RankBoost in that setting and point out its connec- 
tions with AdaBoost. We also introduce the notions of ROC curves and area under 
the ROC curves (AUC) which are directly relevant to bipartite ranking. For the 
preference-based setting, we present a series of results, in particular regret-based 
guarantees for both a deterministic and a randomized algorithm, as well as a lower 
bound in the deterministic case. 


9.1 The problem of ranking 


We first introduce the most commonly studied scenario of the ranking problem in 
machine learning. We will refer to this scenario as the score-based setting of the 
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ranking problem. In section 9.6, we present and analyze an alternative setting, the 
preference-based setting. 

The general supervised learning problem of ranking consists of using labeled 
information to define an accurate ranking prediction function for all points. In the 
scenario examined here, the labeled information is supplied only for pairs of points 
and the quality of a predictor is similarly measured in terms of its average pairwise 
misranking. The predictor is a real-valued function, a scoring function: the scores 
assigned to input points by this function determine their ranking. 

Let ¥ denote the input space. We denote by D an unknown distribution over 
X x & according to which pairs of points are drawn and by f: ¥ x X — {—1,0,+1} 
a target labeling function or preference function. The three values assigned by f 
are interpreted as follows: f(a,xz') = +1 if a’ is preferred to x or ranked higher 
than a, f(a,x’) = —1 if x is preferred to x’, and f(x,2’) = 0 if both 2 and 2’ have 
the same preference or ranking, or if there is no information about their respective 
ranking. This formulation corresponds to a deterministic scenario which we adopt 
for simplification. As discussed in section 2.4.1, it can be straightforwardly extended 
to a stochastic scenario where we have a distribution over ¥ x X x {—1,0,+1}. 

Note that in general no particular assumption is made about the transitivity of the 
order induced by f: we may have f(x, 2’) =1 and f(a#’,2”) =1 but f(z,2”) =-1 
for three points x, x’, and x”. While this may contradict an intuitive notion of 
preference, such preference orders are in fact commonly encountered in practice, in 
particular when they are based on human judgments. This is sometimes because the 
preference between two items are decided based on different features: for example, 
an individual may prefer movie 2’ to x because 2’ is an action movie and « a 
musical, and prefer x” to x’ because x” is an action movie with more active scenes 
than x’. Nevertheless, he may prefer x to 2” because the cost of renting a DVD 
for x” is prohibitive. Thus, in this example, two features, the genre and the price, 
are invoked, each affecting the decision for different pairs. In fact, in general, no 
assumption is made about the preference function, not even the antisymmetry of 
the order induced; thus, we may have f(z,2’) = 1 and f(#’,v) =1 and yet a 42a’. 

The learner receives a labeled sample S = ((21,24,41),---;(@m:2nsYm)) € 
Xx X x {-1,0,+1} with (7,274),...,(@m,2),,) drawn i.i.d. according to D and 
yi = f (xi, v4) for alli € [1, m]. Given a hypothesis set H of functions mapping ¥ to 
R, the ranking problem consists of selecting a hypothesis h € H with small expected 
pairwise misranking or generalization error R(h) with respect to the target f: 


R(h) = Pr [(Flaa") #0) A (faa! )(A(a!) — h(a)) <0)]. (9.1) 


o~ 


The empirical pairwise misranking or empirical error of h is denoted by R(h) and 
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defined by 


se 1 m 
Rh) = — S laaneutn needy cae (9.2) 
i=1 


Note that while the target preference function f is in general not transitive, the 
linear ordering induced by a scoring function h € H is by definition transitive. This 
is a drawback of the score-based setting for the ranking problem since, regardless of 
the complexity of the hypothesis set H, if the preference function is not transitive, 
no hypothesis h € H can faultlessly predict the target pairwise ranking. 


9.2 Generalization bound 


In this section, we present margin-based generalization bounds for ranking. To 
simplify the presentation, we will assume for the results of this section that the 
pairwise labels are in {—1,+1}. Thus, if a pair (x, 2’) is drawn according to D, then 
either x is preferred to x’ or the opposite. The learning bounds for the general case 
have a very similar form but require more details. As in the case of classification, 
for any p > 0, we can define the empirical margin loss of a hypothesis h for pairwise 
ranking as 


3 


Ry(h) = Dp(yi(h(a;) — h(a), (9.3) 


1 

Mm « 
w=1 

where ®, is the margin loss function (definition 4.3). Thus, the empirical margin 

loss for ranking is upper bounded by the fraction of the pairs (a;,2/) that h is 


misranking or correctly ranking but with confidence less than p: 
- 1 
Ro(h) < — ) lycnte)—na))<e- (9.4) 
i=1 


We denote by D, the marginal distribution of the first element of the pairs in ¥ x ¥ 
derived from D, and by Dz the marginal distribution with respect to the second 
element of the pairs. Similarly, S; is the sample derived from S by keeping only the 
first element of each pair: $1 = ((1,y1),---;(&m;¥Ym)) and $2 the one obtained by 
keeping only the second element: Sz = ((21,41),---,(@in+Ym)). We also denote by 
1(H) the Rademacher complexity of H with respect to the marginal distribution 
Dy, that is R21(H) = E[Ryg, (HA), and similarly %22(H) = E[Ry,(A)]. Clearly, if 
the distribution D is symmetric, the marginal distributions D; and Dp coincide and 
MPs (H) = RP2(H). 
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Theorem 9.1 Margin bound for ranking 

Let H be a set of real-valued functions. Fix p > 0; then, for any 6 > 0, with 
probability at least 1 — 6 over the choice of a sample S of size m, each of the 
following holds for allh € H: 


R(h) < Ry(h) + = (RH ) + 2(H)) + oes (9.5) 
R(h) < Rp(h) + 7 (Rs, ) + Rs,(H)) +3 ae (9.6) 


Proof The proof is similar to that of theorem 4.4. Let H be the family of 
hypotheses mapping (V x #) x {—1,+1} to R defined by H = {z = ((a,2’),y) 
y[h(x') — h(x)|: h € H}. Consider the family of functions H = {®,of: fe H} 
derived from H which are taking values in [0, 1]. By theorem 3.1, for any 6 > 0 with 
probability at least 1 — 6, for allh € A, 


log $ 
2m — 


E [®,(y[h(a’) — h(a)])] < Rp(h) + 2%m(®, 0 H) + 


Since ly<o < ®,(u) for all u € R, the generalization error R(h) is a lower bound 
on left-hand side, R(h) = E[lyn(a")—n(a)]<o] < E [®,(y[h(x’) — h(a)])], and we can 


write: 


a ~ log $ 
R(h) < Rp(h) + 2M (®, 0 A) + a 
mm 


Exactly as in the proof of theorem 4.4, we can show that Rm (®, o H)< Es om (H) 
using the (1/p)-Lipschitzness of @,. Here, Wy, (H H) can be upper bounded as iolows: 


os 1 r 
— yi(h h 
mint) = B [sup DJ oum(h(a) (2))] 
1 m 
= — E | sup a oi (h(a;) n(xi))| (y,o; and o;: same distrib.) 
Mm S,o hceH =i 
1 m m 
<—E | sup > ojh(a',) + sup + oih(ai)] (by sub-additivity of sup) 
Mm S.0 LreH a heH 44 
=E [Pts, (H)+%s, (H)| (definition of S, and S2) 


= 922 (H) + RH), 


which proves (9.5). The second inequality, (9.6), can be derived in the same way by 
using the second inequality of theorem 3.1, (3.4), instead of (3.3). m 
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These bounds can be generalized to hold uniformly for all p > 0 at the cost of an 
additional term \/ (log log,(2/p))/m, as in theorem 4.5 and exercise 4.2. As for other 
margin bounds presented in previous sections, they show the conflict between two 


terms: the larger the desired pairwise ranking margin p, the smaller the middle term. 
However, the first term, the empirical pairwise ranking margin loss R,,, increases as 
a function of p. 

Known upper bounds for the Rademacher complexity of a hypothesis H, including 
bounds in terms of VC-dimension, can be used directly to make theorem 9.1 more 
explicit. In particular, using theorem 9.1, we obtain immediately the following 
margin bound for pairwise ranking using kernel-based hypotheses. 


Corollary 9.1 Margin bounds for ranking with kernel-based hypotheses 
Let K: ¥ x X —R be a PDS kernel with r = sup,cy K(x, x). Let ®: X — H be a 
feature mapping associated to K and let H = {a> w- (zx): ||wl|q_ < A} for some 
A> 0. Fiz p> 0. Then, for any 6 > 0, the following pairwise margin bound holds 
with probability at least 1—6 for any h € H: 


R(h) < R,(h) 4 af Xe ! nese (9.7) 


As with theorem 4.4, the bound of this corollary can be generalized to hold 
uniformly for all p > 0 at the cost of an additional term \/(log log(2/p))/m. This 
generalization bound for kernel-based hypotheses is remarkable, since it does not 
depend directly on the dimension of the feature space, but only on the pairwise 


ranking margin. It suggests that a small generalization error can be achieved when 
p/r is large (small second term) while the empirical margin loss is relatively small 
(first term). The latter occurs when few points are either classified incorrectly or 
correctly but with margin less than p. 


9.3. Ranking with SVMs 


In this section, we discuss an algorithm that is derived directly from the theoretical 
guarantees just presented. The algorithm turns out to be a special instance of the 
SVM algorithm. 

Proceeding as in section 4.4 for classification, the guarantee of corollary 9.1 can 
be expressed as follows: for any 6 > 0, with probability at least 1 — 6, for all 
he H={xww- ®&(zx): ||wl| < A}, 


i, (9.8) 
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where &; = max(1— y[w- (®(2‘) — ®(a;))],0) for all i € [1,mJ], and where 
®: X — His a feature mapping associated to a PDS kernel kK. An algorithm based 
on this theoretical guarantee consists of minimizing the right-hand side of (9.8), 
that is minimizing an objective function with a term corresponding to the sum of 
the slack variables €;, and another one minimizing ||w|| or equivalently ||w||?. Its 


optimization problem can thus be formulated as 
min Ll? +C : & (9.9) 
nee i=1 . 


subject to: y; [w -(B(2)) — B(x;))| >1-& 
&>0, Wie [l,m]. 


This coincides exactly with the primal optimization problem of SVMs, with a feature 
mapping ©: «x X — H defined by W(z, 2’) = ®(2’)— B®(z) for all (a, a/) Ee XxX, 
and with a hypothesis set of functions of the form (a,2’) H w- W(z,x’). Thus, 
clearly, all the properties already presented for SVMs apply in this instance. In 
particular, the algorithm can benefit from the use of PDS kernels. Problem (9.9) 
admits an equivalent dual that can be expressed in terms of the kernel matrix K’ 
defined by 


Kj, = U (xi, a) -Y (xj, 25) = K(ai,0;)+ K(x}, 25) — K (a, 23) — K(ai,25), (9-10) 


for all 7,7 € [1,m]. This algorithm can provide an effective solution for pairwise 
ranking in practice. The algorithm can also be used and extended to the case where 
the labels are in {—1,0,+1}. The next section presents an alternative algorithm for 
ranking in the score-based setting. 


9.4 RankBoost 


This section presents a boosting algorithm for pairwise ranking, RankBoost, similar 
to the AdaBoost algorithm for binary classification. RankBoost is based on ideas 
analogous to those discussed for classification: it consists of combining different 
base rankers to create a more accurate predictor. The base rankers are hypotheses 
returned by a weak learning algorithm for ranking. As for classification, these 
base hypotheses must satisfy a minimal accuracy condition that will be described 
precisely later. 

Let H denote the hypothesis set from which the base rankers are selected. 
Algorithm 9.1 gives the pseudocode of the RankBoost algorithm when H is a set of 
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RANKBoost(S = ((2@1, 24,1) --+;(@m;2n5Ym))) 


1 fori<1to mdo 


2 Dilys 

3 for t<—1to T do 

4 hy — base ranker in H with smallest e— — ef = — & [yi (e(2) - he(xi))| 
5 ay — 5 log os : 

6 Z,— + 2ete7]? > normalization factor 

7 for 11 to mdo 

. Diya(t) — POL? aes etal) roee)] 

2 oe area ah 


10 return g 


Figure 9.1 RankBoost algorithm for H C {0,1}*. 
functions mapping from ¥ to {0,1}. For any s € {—1,0,+1}, we define e? by 


= D7 Di) Lys (ailet)—relw))=9 =, E [Lys (relat) —re(wi))=sh (9.11) 
1=1 


in Dt 


and simplify the notation ¢/1 into ef and similarly write e; instead of e;*. With 
these definitions, clearly the following equality holds: e? + ef + ep =1. 

The algorithm takes as input a labeled sample S = ((71, 74, 41),---5(@ms2imsYm)) 
with elements in ¥ x ¥ x {—1,0,+1}, and maintains a distribution over the subset 
of the indices i € {1,...,m} for which y; 4 0. To simplify the presentation, we will 
assume that y; # 0 for all i © {1,...,m} and consider distributions defined over 
{1,...,m}. This can be guaranteed by simply first removing from the sample the 
pairs labeled with zero. 

Initially (lines 1-2), the distribution is uniform (D,). At each round of boosting, 
that is at each iteration t € [1,7] of the loop 3-8, a new base ranker hy € H is 
selected with the smallest difference e; — ¢j, that is one with the smallest pairwise 
misranking error and largest correct pairwise ranking accuracy for the distribution 
Di: 


he € argmin { - i, [yi(h(2) - h(i) i 


Note that ¢ —éf =e, —(—e, —e) =2e +e? —1. Thus, finding the smallest 
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difference ¢; —j* is equivalent to seeking the smallest 2e; +€?, which itself coincides 
with seeking the smallest ¢, when e? = 0. Z is simply a normalization factor to 
ensure that the weights D;,,(i) sum to one. RankBoost relies on the assumption 
that at each round t € [1,7], for the hypothesis h; found, the inequality ef —e; > 0 
holds; thus, the probability mass of the pairs correctly ranked by h; (ignoring pairs 
with label zero) is larger than that of misranked pairs. We denote by yj; the edge of 
the base ranker hy: y, = See 

The precise reason for the definition of the coefficient a, (line 5) will become 
clear later. For now, observe that if & — ef > 0, then ef /ep > 1 and cq > 0. 
Thus, the new distribution D;+,; is defined from D; by increasing the weight on 
i if the pair (a;,2/) is misranked (y;(hi(2,) — he(a;) < 0), and, on the contrary, 
decreasing it if (a;,2/) is ranked correctly (y;(hi(x}) — hi(a;) > 0). The relative 
weight is unchanged for a pair with h;(x) — h(a;) = 0. This distribution update 
has the effect of focusing more on misranked points at the next round of boosting. 

After T rounds of boosting, the hypothesis returned by RankBoost is g, which is 
a linear combination of the base classifiers h;. The weight a; assigned to h; in that 
sum is a logarithmic function of the ratio of ¢ and e;. Thus, more accurate base 
rankers are assigned a larger weight in that sum. 

For any t € [1,7], we will denote by g the linear combination of the base rankers 
after ¢ rounds of boosting: g, = 25 arht. In particular, we have gr = g. The 
distribution D;,, can be expressed in terms of g; and the normalization factors Z,, 
s € {1,¢], as follows: 


ea vi lge(@5))—ge(#:)) 
we I= Zs 
We will make use of this identity several times in the proofs of the following sections. 


It can be shown straightforwardly by repeatedly expanding the definition of the 
distribution over the point 2;: 


Vi € (1, mJ, Di+i(%) — 


(9.12) 


Dy (ie evi (he (#5) he (a) 


D i)= 
veil) - 
Dy_1(i)e7 2-192 (e—1 (5) hea (4) greys (Fee (#5) he (:)) 
7 Zy1Z 
en Yi Wsa1 As (hs (27)—hs (xi) 
= t 
mIT,=1 Zs 


9.4.1 Bound on the empirical error 


We first show that the empirical error of RankBoost decreases exponentially fast 
as a function of the number of rounds of boosting when the edge y of each base 
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ranker h; is lower bounded by some positive value y > 0. 


Theorem 9.2 
The empirical error of the hypothesis h: & — {0,1} returned by RankBoost verifies: 


R(h) < exp |- 2 (25*)"] (9.13) 


ee 


Furthermore, if there exists y such that for allt € [1,T],0<7< — , then 


nan 


R(h) < exp(—27°T). (9.14) 


Proof Using the general inequality 1.<o < exp(—u) valid for all u € R and 
identity 9.12, we can write: 


= 1 <2 ee ee 
R(h) = — 9 lytale!)—aaa so S — Ye BOI aee) 
i=1 i=1 
m ae oa 
1 
< oe mT 24) Drs = I] 


By the definition of normalization factor, for all t € [1,7], we have Z = 
peer Di(i)e eeu (he(@)—-he(@)), By grouping together the indices i for which 
yi(hy(a',) — hy(a;)) takes the values in +1, —1, or 0, Z, can be rewritten as 


[ - [ + 
- = € _ fe = 
Zoe *+e ett da=e S + €; = pe =AWlefe +e. 
t t 


Since e = 1-7 — e?, we have 


Acie, = (et +ep)? — (ef —e)? = (1- ef)? — (et — )?. 


Thus, assuming that ¢€? < 1, Z can be upper bounded as follows: 


Z=V(l- 8) -(e —@ 2+ 


< exp (ak) < exp (-5*) < exp (—2[(e" — ef )/2]?) , 


where we used for the first inequality the identity 1 — 2 < e~* valid for allz € R 
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and for the second inequality the convexity of the exponential function and the fact 
that 0 <1-—e? <1. This upper bound on Z; also trivially holds when ¢? = 1 since 
in that case ej = €; = 0. This concludes the proof. m 


As can be seen from the proof of the theorem, the weak ranking assumption 
y < “3 with y > 0 can be replaced with the somewhat weaker requirement 


+4 - fe 
y < +4, with e? # 1, which can be rewritten as y < $+, with ef +e, #0, 


24/ 1-&® d lef per ’ 


+ — 
where the quantity vaae can be interpreted as a (normalized) relative difference 
€e Tee 


between ¢7 and e; . 

The proof of the theorem also shows that the coefficient a; is selected to minimize 
Z,. Thus, overall, these coefficients are chosen to minimize the upper bound on 
the empirical error i Zt, as for AdaBoost. The RankBoost algorithm can be 
generalized in several ways: 


= instead of a hypothesis with minimal difference e; — €/, hy can be more generally 
a base ranker returned by a weak ranking algorithm trained on D; with ef > 6; 


= the range of the base rankers could be [0, +1], or more generally R. The coefficients 
a, can then be different and may not even admit a closed form. However, in general, 
they are chosen to minimize the upper bound eae Z, on the empirical error. 


9.4.2 Relationship with coordinate descent 


RankBoost coincides with the application of the coordinate descent technique 
to a convex and differentiable objective function F' defined for all samples S = 
((21, 24%), ee (Vestal) EXx &X x {-1,0,+1} and a = (qj,...,a,) € R”, 
n> 1 by 
F(a) = > eo ¥ilgn (@i)—9n(wi)] > en Yi Data Me lhe (a) ha (wa) (9.15) 
i=l 


i=l 


where gn = ee a,h,. This loss function is a convex upper bound on the zero-one 
pairwise loss function @ > pen Lys ton (2!)—gn (i) <0> which is not convex. Let e; 
denote the unit vector corresponding to the tth coordinate in R” and let a; denote 
the vector based on the (t—1) first coefficients, ie. a¢—1 = (a4,...,Q4—-1,0,...,0)" 
ift—1> 0, a;_1 = 0 otherwise. At each iteration t > 1, the direction e; selected 
by coordinate descent is the one minimizing the directional derivative: 


dF(ay-1 + ney) 
dn 


e, = argmin : 
t n=0 
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t 


Since F(ay_1 + ner) = ee evi et Os (hs (aj)—hs(wi))—nyi(he(ai)—he(as)) the direc- 
tional derivative along e; can be expressed as follows: 


dF (ar-1 + nex) 
dn 


n=0 


== yiltu(el) ~ hele) )) exp | - Dat of — he(2))] 
Ee ~ he (2s)) baa [m Tz 


m t-1 


=~ [Jo Peli, a!) —he(0s))=41 — 2, Pili Ly, (hea!) ntai=-s| [mT] 2. 
s=l1 
asi 12) 
s=1 


The first equality holds by differentiation and evaluation at 7 = 0 and the second 
one follows from (9.12). In view of the final equality, since mI} Zs is fixed, 
the direction e; selected by coordinate descent is the one minimizing «;, which 
corresponds exactly to the base ranker h; selected by RankBoost. 

The step size 7 is identified by setting the derivative to zero in order to minimize 
the function in the chosen direction e;. Thus, using identity 9.12 and the definition 
of €, we can write: 


dF (ar-1 + nex) 


=0 
dn 


A s yi(hy(2,) — he(ai))e~¥% DSi Os (ea (4) “As (wi) p— Yi (he (w{)—he(ws)) — Q 
e- ~ yi(he(x;) — he(a:)) Di(i)|m Iz geen male) =0 


i > yi(he(2) — he(2j)) De (4) @7 1i (ei) “he (wa) =0 


t 

1 + 

= slog. 
ae 


This proves that the step size chosen by coordinate descent matches the base 
ranker weight a, of RankBoost. Thus, coordinate descent applied to F’ precisely 
coincides with the RankBoost algorithm. As in the classification case, other convex 
loss functions upper bounding the zero-one pairwise misranking loss can be used. 
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In particular, the following objective function based on the logistic loss can be 
used: a + S7™, log(1 + e~¥il9n(%1)—9n(@)]) to derive an alternative boosting-type 
algorithm. 


9.4.3. Margin bound for ensemble methods in ranking 


To simplify the presentation, we will assume for the results of this section, as 
in section 9.2, that the pairwise labels are in {—1,+1}. By theorem 6.2, the 
empirical Rademacher complexity of the convex hull conv(H) equals that of H. 
Thus, theorem 9.1 immediately implies the following guarantee for ensembles of 
hypotheses in ranking. 


Corollary 9.2 

Let H be a set of real-valued functions. Fix p > 0; then, for any 6 > 0, with 
probability at least 1 — 6 over the choice of a sample S' of size m, each of the 
following ranking guarantees holds for all h € conv(H): 


RH) < Rep(h) + © (BRE (HT) + 98R"(HD) + a (9.16) 
R(h) < B,(h) + = (Bs, (H) + Rs,(H)) +3 8 5 (9.17) 


For RankBoost, these bounds apply to g/||a||1, where g is the hypothesis returned 
by the algorithm. Since g and g/||a||, induce the same ordering of the points, for 
any 6 > 0, the following holds with probability at least 1 — 6: 


R(g) < Ro(g/lleel|s) + = (Rint (A) + Rin? (H)) + (9.18) 


2 

p 
Remarkably, the number of rounds of boosting T’ does not appear in this bound. 
The bound depends only on the margin p, the sample size m, and the Rademacher 
complexity of the family of base classifiers H. Thus, the bound guarantees an effec- 
tive generalization if the pairwise margin loss R,(g/ ||@||,) is small for a relatively 
large p. A bound similar to that of theorem 6.3 for AdaBoost can be derived for the 
empirical pairwise ranking margin loss of RankBoost (see exercise 9.3) and similar 
comments on that result apply here. 

These results provide a margin-based analysis in support of ensemble methods 
in ranking and RankBoost in particular. As in the case of AdaBoost, however, 
RankBoost in general does not achieve a maximum margin. But, in practice, it has 
been observed to obtain excellent pairwise ranking performances. 
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9.5 Bipartite ranking 


This section examines an important ranking scenario within the score-based setting, 
the bipartite ranking problem. In this scenario, the set of points ¥ is partitioned 
into two classes: ¥, the class of positive points, and V_ that of negative ones. The 
problem consists of ranking positive points higher than negative ones. For example, 
for a fixed search engine query, the task consists of ranking relevant (positive) 
documents higher than irrelevant (negative) ones. 

The bipartite problem could be treated in the way already discussed in the 
previous sections with exactly the same theory and algorithms. However, the setup 
typically adopted for this problem is different: instead of assuming that the learner 
receives a sample of random pairs, here pairs of positive and negative elements, it 
is assumed that he receives a sample of positive points from some distribution and 
a sample of negative points from another. This leads to the set of all pairs made of 
a positive point of the first sample and a negative point of the second. 

More formally, the learner receives a sample S; = (a},...,2/,) drawn iid. 
according to some distribution Dy over X,, and a sample S_ = (21,...,2,) drawn 
iid. according to some distribution D_ over X_.1 Given a hypothesis set H of 
functions mapping % to R, the learning problem consists of selecting a hypothesis 
h € H with small expected bipartite misranking or generalization error R(h): 

R(h) = Pr [h(2’) < h(z)]. (9.19) 


a~D_ 
an D+ 


The empirical pairwise misranking or empirical error of h is denoted by R(h) and 
defined by 


m n 


a 1 
R(h) = —— DD Ine <aes)- (9.20) 


Note that while the bipartite ranking problem bears some similarity with binary 
classification, in particular, the presence of two classes, they are distinct problems, 
since their objectives and measures of success clearly differ. 


1. This two-distribution formulation also avoids a potential dependency issue that can 
arise for some modeling of the problem: if pairs are drawn according to some distribution 
D over X_ x #4 and the learner makes use of this information to augment his training 
sample, then the resulting sample is in general not ii.d. This is because if (21,24) and 
(x2, 24) are in the sample, then so are the pairs (1,24) and (x2, x) and thus the pairs are 
not independent. However, without sample augmentation, the points are i.i.d., and this 
issue does not arise. 
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By the definition of the formulation of the bipartite ranking just presented, the 
learning algorithm must typically deal with mn pairs. For example, the application 
of SVMs to ranking in this scenario leads to an optimization with mn slack variables 
or constraints. With just a thousand positive and a thousand negative points, 
one million pairs would need to be considered. This can lead to a prohibitive 
computational cost for some learning algorithms. The next section shows that 
RankBoost admits an efficient implementation in the bipartite scenario. 


9.5.1 Boosting in bipartite ranking 


This section shows the efficiency of RankBoost in the bipartite scenario and dis- 
cusses the connection between AdaBoost and RankBoost in this context. 

The key property of RankBoost leading to an efficient algorithm in the bipartite 
setting is the fact that its objective function is based on the exponential function. 
As a result, it can be decomposed into the product of two functions, one depending 
on only the positive and the other on only the negative points. Similarly, the 
distribution D, maintained by the algorithm can be factored as the product of 
two distributions Dj and D;. This is clear for the uniform distribution D, at the 
first round as for any i € [1,m] and j € [1,n], Di(i,j) = 1/(mn) = Df (i)DT (J) 
with Df (i) =1/m and D7 (j) =1/n. This property is recursively preserved since, 
in view of the following, the decomposition of D; implies that of D:+:1 for any 
t € [1,7]. For any 7 € [l,m] and j € [1l,n], by definition of the update, we can 
write: 


Deli, jewel @D-heleD] DF (Ae~eee(@) D= (j)erehel@a) 
_— ut t VJ 


DiiGa= 
t+1(4, J) Z, Tea Ie. ) 


since the normalization factor Z; can also be decomposed as Z = Z, Z;', with 
fy = PD le oe and. 2, = i Dy (j)e%"«(*s), Furthermore, the 
pairwise misranking of a hypothesis h € H based on the distribution D, used to 
determine h; can also be computed as the difference of two quantities, one depending 
only on positive points, the other only on negative ones: 


EB [h(a;)— h(a] = E [ E [h(a;)—h(e,)l]=_ E [h(a)]-_ E_[h(a,)]. 


oe 7 7 7 
(i,9)~Dt inD} j~D; inDt j~Dr 


Thus, the time and space complexity of RankBoost depends only on the total 
number of points m+n and not the number of pairs mn. More specifically, ignoring 
the call to the weak ranker or the cost of determining h;, the time and space 
complexity of each round is linear, that is, in O(m-+ 7). Furthermore, the cost of 
determining h, depends only on O(m +n) and not on O(mn). Figure 9.2 gives the 
pseudocode of the algorithm adapted to the bipartite scenario. 

In the bipartite scenario, a connection can be made between the classification algo- 


9.5 Bipartite ranking 223 


BIPARTITERANKBOOST(S = (24,---,07,,U1,--+;%n)) 


1 for 7 —1to mdo 


2 DIG) 

3 fori-1tondo 

4 Dy 2 

5 for t<—1to T do 

6 hy — base ranker in H with smallest ¢— —ef = E [h(x;)|— E [h(2%)] 
7 fet Moy = jr~D; in Dt 
8 Greate ct Sere. 

9 for 11 to mdo 

Bhi sealant 

a Z-l-@tvVee 

12 for 71 to ndo 

13 Diya (g) — Peo rentate] 


14 ge = 4 arhy 


15 return g 


Figure 9.2 Pseudocode of RankBoost in a bipartite setting, with H C {0,1}*, 
= E,. p+ (h(a,)] and eF = Ej. p> [h(x,)]. 


7 


rithm AdaBoost and the ranking algorithm RankBoost. In particular, the objective 


function of RankBoost can be expressed as follows for any a = (a1,...,ar) € R’, 
T>1: 
FRankBoost (@) = S- >. exp(—[g(x‘) i g(x;)]) 
j=l i=l 
= os ehh ey ‘ey et Dea ae) 
i=1 j=l 
= F,(a)F_(a), 


where F', denotes the function defined by the sum over the positive points and F_ 
the function defined over the negative points. The objective function of AdaBoost 
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can be defined in terms of these same two functions as follows: 


FadaBoost(&) =~ exp(—yig(a4)) + 5° exp(—yj9(25)) 
i=l j=l 


m n 
_ oe e~ pea athe (a) + > et Seg azhy(x;) 


= F,(a)+ F_(a). 


Note that the gradient of the objective function of RankBoost can be expressed in 
terms of AdaBoost as follows: 


Val RankBoost(@) = F_(a)VaFy(a) + Fi(a@)VaF_(a@) (9.21) 
= F(a) (VoFy (a) + VoF_(a)) + (Fy (a) — F_(a))VoaF_(a) 
= F_(Q)VaF adaBoost(@) + (F(a) — F_(@))VaF_(@). 


If a is a minimizer of Fa gaBoost; then Val AdaBoost(@) = 0 and it can be shown that 
the equality F(a) — F_(a@) = 0 also holds for a, provided that the family of base 
hypotheses H used for AdaBoost includes the constant hypothesis ho: «+> 1, which 
often is the case in practice. Then, by (9.21), this implies that Va FRankBoost(@) = 0 
and therefore that @ is also a minimizer of the convex function FRankBoost- In 
general, F'agaBoost does not admit a minimizer. Nevertheless, it can be shown that 
if limpoo FadaBoost(@x) = infa FadaBoost(@) for some sequence (@x)xen, then, 
under the same assumption on the use of a constant base hypothesis and for 
a non-linearly separable dataset, the following holds: limp—oo FRankBoost(@k) = 
inf, FRankBoost (a). 

The connections between AdaBoost and RankBoost just mentioned suggest that 
AdaBoost could achieve a good ranking performance as well. This is often observed 
empirically, a fact that brings strong support to the use of AdaBoost both as a 
classifier and a ranking algorithm. Nevertheless, RankBoost may converge faster 
and achieve a good ranking faster than AdaBoost. 


9.5.2 Area under the ROC curve 


The performance of a bipartite ranking algorithm is typically reported in terms of 
the area under the receiver operating characteristic (ROC) curve, or the area under 
the curve (AUC) for short. 


Let U be a test sample used to evaluate the performance of h (or a training 
sample) with m positive points z/,...,z/,, and n negative points 21,...,Z,. For any 


+@m 


h € H, let R(h, U) denote the average pairwise misranking of h over U. Then, the 


o~ 


AUC of h for the sample U is precisely 1 — R(h,U), that is, its average pairwise 
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True positive rate 


0 2 A 6 8 1 
False positive rate 


Figure 9.3 The AUC (area under the ROC curve) is a measure of the performance 
of a bipartite ranking. 


ranking accuracy on U: 


AUC(h, U) = Dy Lp(2t)>h(z3) = Pre? > h(z)). 
ne oe ae 1 EEL 
zn~Dt 


Here, De denotes the empirical distribution corresponding to the positive points in 
U and Di the empirical distribution corresponding to the negative ones. AUC(h, U) 
is thus an empirical estimate of the pairwise ranking accuracy based on the sample 
U, and by definition it is in [0,1]. Higher AUC values correspond to a better ranking 
performance. In particular, an AUC of one indicates that the points of U are ranked 
perfectly using h. AUC(h,U) can be computed in linear time from a sorted array 
containing the m+n elements h(zj) and h(z;), for i € [1,m] and j € [1,n]. Assuming 
that the array is sorted in increasing order (with a positive point placed higher than a 
negative one if they both have the same scores) the total number of correctly ranked 
pairs r can be computed as follows. Starting with r = 0, the array is inspected in 
increasing order of the indices while maintaining at any time the number of negative 
points seen nm and incrementing the current value of r with n whenever a positive 
point is found. After full inspection of the array, the AUC is given by r/(mn). Thus, 
assuming that a comparison-based sorting algorithm is used, the complexity of the 
computation of the AUC is in O((m+ n) log(m + n)). 

As indicated by its name, the AUC coincides with the area under the ROC curve 
(figure 9.3). An ROC curve plots the true positive rate, that is, the percentage of 
positive points correctly predicted as positive as a function of the false positive 
rate, that is, the percentage of negative points incorrectly predicted as positive. 
Figure 9.4 illustrates the definition and construction of an ROC curve. 
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Figure 9.4 An example ROC curve and illustrated threshold. Varying the value 
of @ from one extreme to the other generates points on the curve. 


Points are generated along the curve by varying a threshold value @ as in the 
right panel of figure 9.4, from higher values to lower ones. The threshold is used to 
determine the label of any point x (positive or negative) based on sgn(h(x) — @). 
At one extreme, all points are predicted as negative; thus, the false positive rate is 
zero, but the true positive rate is zero as well. This gives the first point (0,0) of 
the plot. At the other extreme, all points are predicted as positive; thus, both the 
true and the false positive rates are equal to one, which gives the point (1,1). In 
the ideal case, as already discussed, the AUC value is one, and, with the exception 
of (0,0), the curve coincides with a horizontal line reaching (1, 1). 


9.6 Preference-based setting 


This section examines a different setting for the problem of learning to rank: the 
preference-based setting. In this setting, the objective is to rank as accurately as 
possible any test subset X C 4, typically a finite set that we refer to as a finite query 
subset. This is close to the query-based scenario of search engines or information 
extraction systems and the terminology stems from the fact that X could be a 
set of items needed to rank in response to a particular query. The advantage of 
this setting over the score-based setting is that here the learning algorithm is not 
required to return a linear ordering of all points of V, which may be impossible 
to achieve faultlessly in accordance with a general possibly non-transitive pairwise 
preference labeling. Supplying a correct linear ordering for a query subset is more 
likely to be achievable exactly or at least with a better approximation. 

The preference-based setting consists of two stages. In the first stage, a sample of 
labeled pairs S, exactly as in the score-based setting, is used to learn a preference 
function hh: X x X + [0,1], that is, a function that assigns a higher value to a 
pair (u,v) when w is preferred to v or is to be ranked higher than v, and smaller 
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values in the opposite case. This preference function can be obtained as the output 
of a standard classification algorithm trained on S. A crucial difference with the 
score-based setting is that, in general, the preference function h is not required to 
induce a linear ordering. The relation it induces may be non-transitive; thus, we 
may have, for example, h(u,v) = h(v,w) = h(w,u) = 1 for three distinct points u, 
v, and w. 

In the second stage, given a query subset X € ¥, the preference function h is used 
to determine a ranking of X. How can h be used to generate an accurate ranking? 
This will be the main focus of this section. The computational complexity of the 
algorithm determining the ranking is also crucial. Here, we will measure its running 
time complexity in terms of the number of calls to h. 

When the preference function is obtained as the output of a binary classification 
algorithm, the preference-based setting can be viewed as a reduction of ranking to 
classification: the second stage specifies how a ranking is obtained from a classifier’s 
output. 


9.6.1 Second-stage ranking problem 


The ranking problem of the second stage is modeled as follows. We assume that a 
preference function h is given. From the point of view of this stage, the way the 
function h has been determined is immaterial, it can be viewed as a black box. As 
already discussed, h is not assumed to be transitive. But, we will assume that it is 
pairwise consistent, that is h(u,v) + h(v,u) = 1, for all u,v € &. 

Let D be an unknown distribution according to which pairs (X,o*) are drawn 
where X C ¥ is a query subset and o* a target ranking or permutation of X, 
that is, a bijective function from X to {1,...,|X|}. Thus, we consider a stochastic 
scenario, and o* is a random variable. The objective of a second-stage algorithm A 
consists of using the preference function h to return an accurate ranking A(X) for 
any query subset X. The algorithm may be deterministic, in which case A(X) is 
uniquely determined from X or it may be randomized, in which case we denote by 
s the randomization seed it may depend on. 

The following loss function Z can be used to measure the disagreement between 
a ranking o and a desired one o* over a set X of n > 1 elements: 


“ 2 
L(o,o ) = n(n —1) y lo(u)<o(v) lox (v)<o*(u)s (9.22) 


where the sum runs over all pairs (u,v) with u and v distinct elements of X. All the 
results presented in the following hold for a broader set of loss functions described 
later. Abusing the notation, we also define the loss of the preference function h with 
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respect to a ranking o* of a set X of n > 1 elements by 


. 2 
L(h, 0") = ——~ S— h(u, 0) Low) <o*(u): (9.23) 
n(n — 1) 
UuZ~v 
The expected loss for a deterministic algorithm A is thus E;x,¢+)~p[L(A(X), o*)]. 
The regret of algorithm A is then defined as the difference between its loss and that 
of the best fixed global ranking. This can be written as follows: 
Reg = E  [L(A(X),o*)] — mi E [L(c\y,0° 24 
POA gh le ee Sin ee hs (9.24) 
where oly denotes the ranking induced on X by a global ranking o’ of ¥. Similarly, 
we define the regret of the preference function as follows 
(h)=  E L(h *)] — mi E [L(h 7 ; 
Reg(h) = B hix.o"))— min BL (hx.0°)], (9:25) 
where h,x denotes the restriction of h to X x X, and similarly with h’. The regret 
results presented in this section hold assuming the following pairwise independence 
on irrelevant alternatives property: 
EF [lox(v)<ox(u)] = ED [lo«(ujco*(u ’ 9.26 
ath, er<or wm] = | F [Lorwy<o-(w)] (9.26) 
for any u,v € 4X and any two sets X; and X2 containing u and v2. Similar 
regret definitions can be given for a randomized algorithm additionally taking the 
expectation over s. 

Clearly, the quality of the ranking output by the second-stage algorithm inti- 
mately depends on that of the preference function h. In the next sections, we discuss 
both a deterministic and a randomized second-stage algorithm for which the regret 
can be upper bounded in terms of the regret of the preference function. 


2. More generally, they hold without that assumption using the following weaker notions 
of regret: 


Reg'(A) =. B [L(A(X),0°)] -B[ min B [L(o',0*)]| (9.27) 
Reg!(h) = B[E(hix.0*)] —B [min B [L(h',0*)]) (9.28) 


where o’ denotes a ranking of X, h’ a preference function defined over X x X, and o*|X 
the random variable o* conditioned on X. 
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9.6.2. Deterministic algorithm 


A natural deterministic algorithm for the second-stage is based on the sort-by-degree 
algorithm. This consists of ranking each element of X based on the number of other 
elements it is preferred to according to the preference function h. Let Agort-by-degree 
denote this algorithm. In the bipartite setting, the following bounds can be proven 
for the expected loss of this algorithm and its regret: 


EL (Asort-by-degree(X),0")] $2 E [L(h,o*)] (9.29) 
Reg(Asort-by-degree(X)) < 2Reg(h). (9.30) 


These results show that the sort-by-degree algorithm can achieve an accurate 
ranking when the loss or the regret of the preference function h is small. They 
also bound the ranking loss or regret of the algorithm in terms of the classification 
loss or regret of h, which can be viewed as a guarantee for the reduction of ranking 
to classification using the sort-by-degree algorithm. 

Nevertheless, in some cases, the guarantee given by these results is weak or unin- 
formative owing to the presence of the factor of two. Consider the case of a binary 
classifier h with an error rate of just 25 percent, which is quite reasonable in many 
applications. Assume that the Bayes error is close to zero for the classification prob- 


lem and, similarly, that for the ranking problem the regret and loss approximately 
coincide. Then, using the bound in (9.29) guarantees a worst-case pairwise mis- 
ranking error of at most 50 percent for the ranking algorithm, which is the pairwise 
misranking error of random ranking. 

Furthermore, the running time complexity of the algorithm quadratic, that is in 
Q(|X|?) of a query set X, since it requires calling the preference function for every 
pair (u,v) with uw and v in X. 

As shown by the following theorem, no deterministic algorithm can improve upon 
the factor of two appearing in the regret guarantee of the sort-by-degree algorithm. 


Theorem 9.3 Lower bound for deterministic algorithms 
For any deterministic algorithm A, there is a bipartite distribution for which 


Reg(A) > 2 Reg(h). (9.31) 


Proof Consider the simple case where ¥ = X = {u,v,w} and where the 
preference function induces a cycle as illustrated by figure 9.5a. An arrow from 
u to v indicates that v is preferred to u according to h. The proof is based on an 
adversarial choice of the target o*. 

Without loss of generality, either A returns the ranking u,v,w (figure 9.5b) or 
w,v,u (figure 9.5c). In the first case, let o* be defined by the labeling indicated in 
the figure. In that case, we have L(h,o*) = 1/3, since u is preferred to w according 
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(a) (b) (c) 


Figure 9.5 Illustration of the proof of theorem 9.3. 


to h while w is labeled positively and u negatively. The loss of the algorithm is 
L(A, o*) = 2/3, since both u and v are ranked higher than the positively labeled w 
by the algorithm. Similarly, o* can be defined as in figure 9.5c in the second case, 
and we find again that L(h,o*) = 1/3 and L(A,o*) = 2/3. This concludes the 
proof. m 


The theorem suggests that randomization is necessary in order to achieve a better 
guarantee. In the next section, we present a randomized algorithm that benefits 
both from better guarantees and a better time complexity. 


9.6.3. Randomized algorithm 


The general idea of the algorithm described in this section is to use a straightforward 
extension of the randomized QuickSort algorithm in the second stage. Unlike in 
the standard version of QuickSort , here the comparison function is based on the 
preference function, which in general is not transitive. Nevertheless, it can be shown 
here, too, that the expected time complexity of the algorithm is in O(n log n) when 
applied to an array of size n. 

The algorithm works as follows, as illustrated by figure 9.6. At each recursive 
step, a pivot element wu is selected uniformly at random from X. For each v 4 u, v 
is placed on the left of w with probability h(v, wu) and to its right with the remaining 
probability h(u,v). The algorithm proceeds recursively with the array to the left of 
u and the one to its right and returns the concatenation of the permutation returned 
by the left recursion, u, and the permutation returned by the right recursion. 

Let AguickSort denote this algorithm. In the bipartite setting, the following 
guarantees can be proven: 

eB EAaniarsort(X, 8),07)] = ,E [L(A 0°)] (9.32) 
Reg(AgQuickSort ) < Reg(h) (9.33) 


Thus, here, the factor of two of the bounds in the deterministic case has vanished, 
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Figure 9.6 Illustration of randomized QuickSort based on a preference function h 
(not necessarily transitive). 


which is substantially more favorable. Furthermore, the guarantee for the loss is 
an equality. Moreover, the expected time complexity of the algorithm is only in 
O(nlogn), and, if only the top k items are needed to be ranked, as in many 
applications, the time complexity is reduced to O(n + klogk). 

For the QuickSort algorithm, the following guarantee can also be proven in the 
case of general ranking setting (not necessarily bipartite setting): 
E_ [L(AguickSort (X, 8),0")] < 2 E (Lh, o*)]. (9.34) 
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9.6.4 Extension to other loss functions 


All of the results just presented hold for a broader class of loss functions L,, defined 
in terms of a weight function or emphasis function w. Ly is similar to (9.22), but 
measures the weighted disagreement between a ranking o and a desired one o* over 
a set X of n > 1 elements as follows: 


2 * * 
) S/ we (v),o (u)) lo(u)<o(v) 1o*(u)<o*(u)> (9.35) 


DAO = Ty 
UuZv 


where the sum runs over all pairs (u,v) with u and v distinct elements of X, and 
where w is asymmetric function whose properties are described below. Thus, the loss 
counts the number of pairwise misrankings of o with respect to o*, each weighted 
by w. Function w is assumed to satisfy the following three natural axioms: 

= symmetry: w(i, 7) = w(j,2) for all i, J; 

= monotonicity: w(i,j) < w(i,k) ifeitheri<j<kori>j>k; 

= triangle inequality: w(t, 7) < w(i,k) + w(k, J). 
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The motivation for this last property stems from the following: if correctly ordering 
items in positions (7,k) and (k, 7) is not of great importance, then the same should 
hold for items in positions (i, 7). 

Using different functions w, the family of functions L,, can cover several familiar 
and important losses. Here are some examples. Setting w(i,j) = 1 for alli 4 7 
yields the unweighted pairwise misranking measure. For a fixed integer k > 1, 
the function w defined by w(i,7) = lqi<kyvij<k))ati¢s) for all (i,7) can be used 
to emphasize ranking at the top k elements. Misranking of pairs with at least 
one element ranked among the top k is penalized by this function. This can be 
of interest in applications such as information extraction or search engines where 
the ranking of the top documents matters more. For this emphasis function, all 
elements ranked below & are in a tie. Any tie relation can be encoded using w. 
Finally, in a bipartite ranking scenario with m* positive and m~ negative points 


n(n-1) yields the standard loss function 


and mt +m- = n, choosing w(t, 7) = Imm 


coinciding with 1 — AUC. 


9.7 Discussion 


The objective function for the ranking problems discussed in this chapter were 
all based on pairwise misranking. Other ranking criteria have been introduced in 
information retrieval and used to derive alternative ranking algorithms. Here, we 
briefly present several of these criteria. 


= Precision, precision@n, average precision, recall. All of these criteria assume that 
points are partitioned into two classes (positives and negatives), as in the bipar- 
tite ranking setting. Precision is the fraction of positively predicted points that 
are in fact positive. Whereas precision takes into account all positive predictions, 
preciston@n only considers the top n predictions. For example, precision@5 consid- 
ers only the top 5 positively predicted points. Average precision involves computing 
precision@n for each value of n, and averaging across these values. Each precision@n 
computation can be interpreted as computing precision for a fixed value of recall, 
or the fraction of positive points that are predicted to be positive (recall coincides 
with the notion of true positive rate). 


=» DCG, NDCG. These criteria assume the existence of relevance scores associated 
with the points to be ranked, e.g., given a web search query, each website returned 
by a search engine has an associated relevance score. Moreover, these criteria 
measure the extent to which points with large relevance scores appear at or near the 
beginning of a ranking. Define (c;)jen as a predefined sequence of non-increasing 
and non-negative discount factors, e.g., c; = log(i)~!. Then, given a ranking of m 
points and defining r; as the relevance score of the ith point in this ranking, the 
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discounted cumulative gain (DCG) is defined as DCG = >", cirj. Note that DCG 
is an increasing function of m. In contrast, the normalized discounted cumulative 
gain (NDCG) normalizes the DCG across values of m by dividing the DCG by the 
IDCG, or the ideal DCG that would result from an optimal ordering of the points. 


9.8 Chapter notes 


The problem of learning to rank is distinct from the purely algorithmic one of rank 
aggregation, which, as shown by Dwork, Kumar, Naor, and Sivakumar [2001], is 
NP-hard even for k = 4 rankings. The Rademacher complexity and margin-based 
generalization bounds for pairwise ranking given in theorem 9.1 and corollary 5.1 are 
novel. Margin bounds based on covering numbers were also given by Rudin, Cortes, 
Mohri, and Schapire [2005]. Other learning bounds in the score-based setting of 
ranking, including VC-dimension and stability-based learning bounds, have been 
given by Agarwal and Niyogi [2005], Agarwal et al. [2005] and Cortes et al. [2007b]. 

The ranking algorithm based on SVMs presented in section 9.3 has been used and 
discussed by several researchers. One early and specific discussion of its use can be 
found in Joachims [2002]. The fact that the algorithm is simply a special instance of 
SVMs seems not to be clearly stated in the literature. The theoretical justification 
presented here for its use in ranking is novel. 

RankBoost was introduced by Freund et al. [2003]. The version of the algorithm 
presented here is the coordinate descent RankBoost from Rudin et al. [2005]. 
RankBoost in general does not achieve a maximum margin and may not increase 
the margin at each iteration. A Smooth Margin ranking algorithm [Rudin et al., 
2005] based on a modified version of the objective function of RankBoost can be 
shown to increase the smooth margin at every iteration, but the comparison of 
its empirical performance with that of RankBoost has not been reported. For the 
empirical ranking quality of AdaBoost and the connections between AdaBoost and 
RankBoost in the bipartite, setting see Cortes and Mohri [2003] and Rudin et al. 
[2005]. 

The Receiver Operating Characteristics (ROC) curves were originally developed 
in signal detection theory [Egan, 1975] in connection with radio signals during 
World War II. They also had applications to psychophysics [Green and Swets, 1966] 
and have been used since then in a variety of other applications, in particular for 
medical decision making. The area under an ROC curve (AUC) is equivalent to the 
Wilcoxon-Mann-Whitney statistic [Hanley and McNeil, 1982] and is closely related 
to the Gini index [Breiman et al., 1984] (see also chapter 8). For a statistical analysis 
of the AUC and confidence intervals depending on the error rate, see Cortes and 
Mohri [2003, 2005]. The deterministic algorithm in the preference-based setting 
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discussed in this chapter was presented and analyzed by Balcan et al. [2008]. The 
randomized algorithm as well as much of the results presented in section 9.6 are 
due to Ailon and Mohri [2008). 

A somewhat related problem of ordinal regression has been studied by some 
authors [McCullagh, 1980, McCullagh and Nelder, 1983, Herbrich et al., 2000] which 
consists of predicting the correct label of each item out of a finite set, as in multi- 
class classification, with the additional assumption of an ordering among the labels. 
This problem is distinct, however, from the pairwise ranking problem discussed in 
this chapter. 

The DCG ranking criterion was introduced by Jarvelin and Kekalainen [2000], 
and has been used and discussed in a number of subsequent studies, in particular 
Cossock and Zhang [2008] who consider a subset ranking problem formulated in 
terms of DCG, for which they consider a regression-based solution. 


9.9 Exercises 


9.1 Uniform margin-bound for ranking. Use theorem 9.1 to derive a margin-based 
learning bound for ranking that holds uniformly for all p > 0 (see similar binary 
classification bounds of theorem 4.5 and exercise 4.2). 


9.2 On-line ranking. Give an on-line version of the SVM-based ranking algorithm 
presented in section 9.3. 


9.3 Empirical margin loss of RankBoost. Derive an upper bound on the empirical 
pairwise ranking margin loss of RankBoost similar to that of theorem 6.3 for 
AdaBoost. 


9.4 Margin maximization and RankBoost. Give an example showing that Rank- 
Boost does not achieve the maximum margin, as in the case of AdaBoost. 


9.5 RankPerceptron. Adapt the Perceptron algorithm to derive a pairwise ranking 
algorithm based on a linear scoring function. Assume that the training sample is 
linear separable for pairwise ranking. Give an upper bound on the number of updates 
made by the algorithm in terms of the ranking margin. 


9.6 Margin-maximization ranking. Give a linear programming (LP) algorithm re- 
turning a linear hypothesis for pairwise ranking based on margin maximization. 


9.7 Bipartite ranking. Suppose that we use a binary classifier for ranking in the 


9.9 Exercises 235 


bipartite setting. Prove that if the error of the binary classifier is €, then that of the 
ranking it induces is also at most €. Show that the converse does not hold. 


9.8 Multipartite ranking. Consider the ranking scenario in a k-partite setting 
where % is partitioned into k subsets 1,...,4%, with k > 1. The bipartite case 
(k = 2) is already specifically examined in the chapter. Give a precise formulation 
of the problem in terms of & distributions. Does RankBoost admit an efficient 
implementation in this case? Give the pseudocode of the algorithm. 


9.9 Deviation bound for the AUC. Let h be a fixed scoring function used to rank 
the points of Vv. Use Hoeffding’s bound to show that with high probability the AUC 
of h for a finite sample is close to its average. 


9.10 k-partite weight function. Show how the weight function w can be defined so 
that D,, encodes the natural loss function associated to a k-partite ranking scenario. 


10 Regression 


This chapter discusses in depth the learning problem of regression, which consists 
of using data to predict, as closely as possible, the correct real-valued labels of the 
points or items considered. Regression is a common task in machine learning with a 
variety of applications, which justifies the specific chapter we reserve to its analysis. 

The learning guarantees presented in the previous sections focused largely on 
classification problems. Here we present generalization bounds for regression, both 
for finite and infinite hypothesis sets. Several of these learning bounds are based on 
the familiar notion of Rademacher complexity, which is useful for characterizing 
the complexity of hypothesis sets in regression as well. Others are based on a 
combinatorial notion of complexity tailored to regression that we will introduce, 
pseudo-dimension, which can be viewed as an extension of the VC-dimension to 
regression. We describe a general technique for reducing regression problems to 
classification and deriving generalization bounds based on the notion of pseudo- 
dimension. We present and analyze several regression algorithms, including linear 
regression, kernel ridge regression, support-vector regression, Lasso, and several 
on-line versions of these algorithms. We discuss in detail the properties of these 
algorithms, including the corresponding learning guarantees. 


10.1 The problem of regression 


We first introduce the learning problem of regression. Let 4 denote the input space 
and Y a measurable subset of R. We denote by D an unknown distribution over Vv 
according to which input points are drawn and by f: ¥ — Y the target labeling 
function. This corresponds to a deterministic learning scenario that we adopt to 
simplify the presentation. As discussed in section 2.4.1, the deterministic scenario 
can be straightforwardly extended to a stochastic one where we have instead a 
distribution over the pairs (a,y) € ¥ x Y. 

As in all supervised learning problems, the learner receives a labeled sample 
= ( (3449 yxy (Geeta) € (& x Y)™ with x1,...,%m drawn i.i.d. according 
to D, and y; = f(a;) for all i € [l,m]. Since the labels are real numbers, it is 
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not reasonable to hope that the learner could predict precisely the correct label. 
Instead, we can require that his predictions be close to the correct ones. This is 
the key difference between regression and classification: in regression, the measure 
of error is based on the magnitude of the difference between the real-valued label 
predicted and the true or correct one, and not based on the equality or inequality 
of these two values. 

We denote by L: Y x Y — R, the loss function used to measure the magnitude 
of error. The most common loss function used in regression is the squared loss Lo 
defined by L(y, y’) = |y’ —y|? for all y, y’ € Y, or, more generally, an L, loss defined 
by L(y, y’) = |y! — y|?, for some p> 1 and all y,y’ € Y. 

Given a hypothesis set H of functions mapping ¥ to JY, the regression problem 
consists of using the labeled sample S to find a hypothesis h € H with small 
expected loss or generalization error R(h) with respect to the target f: 

R(h) = E. [L(A(x), f(z))]- (10.1) 
As in the previous chapters, the empirical loss or error of h € H is denoted by R(h) 
and defined by 


R(h) = + S > L(h(a:), yi) - (10.2) 
i=1 
In the common case where FL is the squared loss, this represents the mean squared 
error of h on the sample S. 

When the loss function L is bounded by some M > 0, that is L(y’, y) < M for all 
y,y’ € J or, more strictly, L(h(x), f(x)) < M for allh € H and x € #&, the problem 
is referred to as a bounded regression problem. Much of the theoretical results 
presented in the following sections are based on that assumption. The analysis of 
unbounded regression problems is technically more elaborate and typically requires 
some other types of assumptions. 


10.2 Generalization bounds 


This section presents learning guarantees for bounded regression problems. We start 
with the simple case of a finite hypothesis set. 


10.2.1 Finite hypothesis sets 


In the case of a finite hypothesis, we can derive a generalization bound for regression 
by a straightforward application of Hoeffding’s inequality and the union bound. 
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Theorem 10.1 
Let L be a bounded loss function. Assume that the hypothesis set H is finite. Then, 
for any 6 > 0, with probability at least 1 — 6, the following inequality holds for all 


he H: 
: log |H| + log + 
R(h) < R(n) + My/ 22! ins O85 

mm 


Proof By Hoeffding’s inequality, since L takes values in [0,4], for any h € H, 
the following holds: 


Pr [R(h) — R(h) > | < eH 
Thus, by the union bound, we can write 


Pr [3h € H: R(h) - R(h) > €] < LP [R() — Rn) > ¢ < meas 


Setting the right-hand side to be equal to 6 yields the statement of the theorem. m 


With the same assumptions and using the same proof, a two-sided bound can be 
derived: with probability at least 1 — 6, for all h € H, 


x log |H| + log 2 
1R(h) — A(h)| < Myf OSL bes 
2m 


These learning bounds are similar to those derived for classification. In fact, they 
coincide with the classification bounds given in the inconsistent case when M = 1. 
Thus, all the remarks made in that context apply identically here. In particular, 
a larger sample size m guarantees better generalization; the bound increases as a 
function of log|H| and suggests selecting, for the same empirical error, a smaller 
hypothesis set. This is an instance of Occam’s razor principle for regression. In the 
next sections, we present other instances of this principle for the general case of 
infinite hypothesis sets using the notions of Rademacher complexity and pseudo- 
dimension. 


10.2.2 Rademacher complexity bounds 


Here, we show how the Rademacher complexity bounds of theorem 3.1 can be 
used to derive generalization bounds for regression in the case of the family of Ly 
loss functions. We first show an upper bound for the Rademacher complexity of a 
relevant family of functions. 
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Theorem 10.2 Rademacher complexity of L, loss functions 

Let p>1 and Hy = {x |h(x) — f(x)|?: h € H}. Assume that |h(x) — f(x)| < M 
for alla € X andhe€ H. Then, for any sample S of size m, the following inequality 
holds: 


Rs (Hp) < pM?-'Rs(H). 


Proof Let ¢,: 2+ |z|?, then, H, can be rewritten as H, = {¢@,0h: h € H’}, 
where H’ = {r++ h(x) — f(x): h € H}. Since ¢, is pM?~1-Lipschitz over [—M, M], 
we can apply Talagrand’s lemma (lemma 4.2): 


Rs(Hp) < pM?-'Ro(H’). (10.3) 


Now, Roy(H ’) can be expressed as follows: 


Rys(H’) = mB [ae (ojh(a; ) + o:f(xi))| 


heH 


=< Ef sup pa aih(a:)] +E [>> oif(0s)] = Rs(H). 


heH |; 


since Be | O72, o:f(a)] = D1 Beli) f(ei) =0. 


Combining this result with the general Rademacher complexity learning bound 
of theorem 3.1 yields directly the following Rademacher complexity bounds for 
regression with L, losses. 


Theorem 10.3 Rademacher complexity regression bounds 

Let p> 1 and assume that ||h— f\|o. < M for allh € H. Then, for any 6 > 0, with 
probability at least 1—6 over a sample S of size m, each of the following inequalities 
holds for allh € H: 


Tl m li 1 
B [|A(e) — F@)|?] < => |awa) — Fw) |? + 2pMP 7, (HD) + MP4] 38 
in 2m 
E [la(e) - F x)|?| < — Sala) — f(ai)|? + 2pM?- tg (H) + 3M? . 
ne 2m 
As in the case of classification, these generalization bounds suggest a trade-off 
between reducing the empirical error, which may require more complex hypothesis 


sets, and controlling the Rademacher complexity of H, which may increase the 
empirical error. An important benefit of the last learning bound is that it is data- 
dependent. This can lead to more accurate learning guarantees. The upper bounds 
on ,,(H) or Rs (1) for kernel-based hypotheses (theorem 5.5) can be used directly 
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ty 


Ty v2 


Figure 10.1 Illustration of the shattering of a set of two points {x1,xv2} with 
witnesses t; and fe. 


here to derive generalization bounds in terms of the trace of the kernel matrix or 
the maximum diagonal entry. 


10.2.3. Pseudo-dimension bounds 


As previously discussed in the case of classification, it is sometimes computationally 
hard to estimate the empirical Rademacher complexity of a hypothesis set. In 
chapter 3, we introduce other measures of the complexity of a hypothesis set 
such as the VC-dimension, which are purely combinatorial and typically easier 
to compute or upper bound. However, the notion of shattering or that of VC- 
dimension introduced for binary classification are not readily applicable to real- 
valued hypothesis classes. 

We first introduce a new notion of shattering for families of real-valued functions. 
As in previous chapters, we will use the notation G for a family of functions, 
whenever we intend to later interpret it (at least in some cases) as the family of loss 
functions associated to some hypothesis set H: G = {a+ L(h(x), f(x)): h © AH}. 


Definition 10.1 Shattering 
Let G be a family of functions from X to R. A set {21,...,%m}C & is said to be 
shattered by G if there exist t1,...,tm € R such that, 


sen (9(«1) _ t1) 


: :g EG >| =2™. 
sgn (9(&m) — tm) 
When they exist, the threshold values t,,...,tm are said to witness the shattering. 
Thus, {x1,...,%m} is shattered if for some witnesses t1,...,¢m, the family of 


functions G is rich enough to contain a function going above a subset A of the 
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Loss 


Figure 10.2 A function g: ++ L(h(z), f(x)) (in blue) defined as the loss of some 
fixed hypothesis h € H, and its thresholded version x +> 1p (n(z),f(x))>¢ (in red) with 
respect to the threshold ¢ (in yellow). 


set of points I = {(a;,t;): 7 € [1,m]} and below the others (J — A), for any choice 
of the subset A. Figure 10.1 illustrates this shattering in a simple case. The notion 
of shattering naturally leads to the following definition. 


Definition 10.2 Pseudo-dimension 
Let G be a family of functions mapping from * to R. Then, the pseudo-dimension 
of G, denoted by Pdim(G), is the size of the largest set shattered by G. 


By definition of the shattering just introduced, the notion of pseudo-dimension of 
a family of real-valued functions G coincides with that of the VC-dimension of the 
corresponding thresholded functions mapping % to {0,1}: 


Pdim(G) = VCdim({(2,t) + 1i(e)-9>0: 9 € G}). (10.4) 


Figure 10.2 illustrates this interpretation. In view of this interpretation, the follow- 
ing two results follow directly the properties of the VC-dimension. 


Theorem 10.4 
The pseudo-dimension of hyperplanes in RN is given by 


Pdim({x He w-xt+b:w eR, bER}J)=N+1. 


Theorem 10.5 
The pseudo-dimension of a vector space of real-valued functions H is equal to the 
dimension of the vector space: 


Pdim(H) = dim(H). 


The following theorem gives a generalization bound for bounded regression 
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in terms of the pseudo-dimension of a family of loss function G = {a 
L(h(a), f(x)): h € H} associated to a hypothesis set H. The key technique to 
derive these bounds consists of reducing the problem to that of classification by 
making use of the following general identity for the expectation of a random variable 
Xx: 


ELX] = =f Pr[X < t]dt + [~ Pr[X > tld, (10.5) 


which holds by definition of the Lebesgue integral. In particular, for any distribution 
D and any non-negative measurable function f, we can write 


Efe = f° Palla) > dat. (10.6) 
Theorem 10.6 

Let H be a family of real-valued functions and let G = {x L(h(a), f(x)): h € H} 
be the family of loss functions associated to H. Assume that Pdim(G) = d and that 
the loss function L is bounded by M. Then, for any 6 > 0, with probability at least 
1—6 over the choice of a sample of size m, the following inequality holds for all 
he H: 


ee Qdlos 2 log + 
< a4 La 10. 
R(h) < R(h) + My/—— ul > (10.7) 


Proof Let S be a sample of size m drawn i.i.d. according to D and let D denote 
the empirical distribution defined by S. For any h € H and t > 0, we denote by 
c(h, t) the classifier defined by c(h,t): 2 +> 1p n(a),f(x))>t- The error of c(h, t) can 
be defined by 


R(c(h,t)) = Pr le(h,t)(«) = 1] = Pr [L(h(2), f(e)) > 4), 


and, similarly, its empirical error is R(c(h, t)) = Pr. plL(A(z), f(z)) > é. 


Now, in view of the identity (10.6) and the fact that the loss function L is bounded 


244 Regression 


by M, we can write: 


Leh) — Ro] = |B, (CoCx), FCA) —_B LLCAC2), F@) 
M 
=| fo (epee, £62) > a= Be (2(0(a),1(@)) > a) a 
<M sup | Pr [E(h(e), f(a) > f]— Pe [b(A(a), £2) > 4 
te[0,M] |7€ xeED 
=M sup |R(e(h,2)) — Reeth, #)) « 
te[0,M] 


This implies the following inequality: 


Pr [|R() — R(h)| > ( <Pr sup |R(c(h,t)) = Rc(h,t))| > | 


te[0,M] 


The right-hand side can be bounded using a standard generalization bound for clas- 
sification (corollary 3.4) in terms of the VC-dimension of the family of hypotheses 
{c(h,t): h € H,t € [0, M]}, which, by definition of the pseudo-dimension, is pre- 
cisely Pdim(G) = d. The resulting bound coincides with (10.7). 


The notion of pseudo-dimension is suited to the analysis of regression as demon- 
strated by the previous theorem; however, it is not a scale-sensitive notion. There 
exists an alternative complexity measure, the fat-shattering dimension, that is scale- 
sensitive and that can be viewed as a natural extension of the pseudo-dimension. 
Its definition is based on the notion of y-shattering. 


Definition 10.3 7-shattering 

Let G be a family of functions from X to R and let y > 0. A set {11,...,%m} CX 
is said to be y-shattered by G if there exist t1,...,tm € R such that for all 
y € {-1,+1}", there exists g © G such that: 


Wie [l,m], yi(g(zi) — ti) > 7. 


Thus, {71,...,%m} is y-shattered if for some witnesses t),...,tm, the family of 
functions G' is rich enough to contain a function going at least -~ above a subset A 
of the set of points I = {(a;,t;): ¢ € [1, m]} and at least y below the others (J — A), 
for any choice of the subset A. 


Definition 10.4 7-fat-dimension 
The y-fat-dimension of G, fat,(G), is the size of the largest set that is y-shattered 
by G. 


Finer generalization bounds than those based on the pseudo-dimension can be 
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Figure 10.3 For N = 1, linear regression consists of finding the line of best fit, 
measured in terms of the squared loss. 


derived in terms of the y-fat-dimension. However, the resulting learning bounds, 
are not more informative than those based on the Rademacher complexity, which 
is also a scale-sensitive complexity measure. Thus, we will not detail an analysis 
based on the y-fat-dimension. 


10.3. Regression algorithms 


The results of the previous sections show that, for the same empirical error, 
hypothesis sets with smaller complexity measured in terms of the Rademacher 
complexity or in terms of pseudo-dimension benefit from better generalization 
guarantees. One family of functions with relatively small complexity is that of linear 
hypotheses. In this section, we describe and analyze several algorithms based on 
that hypothesis set: linear regression, kernel ridge regression (KRR), support vector 
regression (SVR), and Lasso. These algorithms, in particular the last three, are 
extensively used in practice and often lead to state-of-the-art performance results. 


10.3.1 Linear regression 


We start with the simplest algorithm for regression known as linear regression. Let 
®: X — R* be a feature mapping from the input space VY to R% and consider the 
family of linear hypotheses 


H ={xwHw- ®(rz)+b:w eR, bER}. (10.8) 


Linear regression consists of seeking a hypothesis in H with the smallest empirical 
mean squared error. Thus, for a sample S = ((@1, 41), smh (dean) ) Ee (Xx y)™, 
the following is the corresponding optimization problem: 


m 


min — S>(w- 8(0;) +6—y)?. (10.9) 


w,b m- 
w=1 
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Figure 10.3 illustrates the algorithm in the simple case where N = 1. The optimiza- 
tion problem admits the simpler formulation: 


1 
min F(W) = —||K'W— Y||?, (10.10) 
Ww m 
ue Yu 
using the notation X = [ Pe) os Pm) LL W= oe and Y = | : | . The objective 
= Urn 


i 
function F is convex, by composition of the convex function u + |lul|? with the 


affine function W +> X'W — Y, and it is differentiable. Thus, F admits a global 
minimum at W if and only if VF'(W) = 0, that is if and only if 


9 
—x(K'W-Y)=06 XX'W=XyY. (10.11) 
m 


When XX" is invertible, this equation admits a unique solution. Otherwise, the 
equation admits a family of solutions that can be given in terms of the pseudo-inverse 
of matrix XX! (see appendix A) by W = (XX')'XY+4+ (I—(XX')'(XX'))Wo, 
where Wo is an arbitrary matrix in RY*%. Among these, the solution W = 
(XX1)'XY is the one with the minimal norm and is often preferred for that reason. 
Thus, we will write the solutions as 


(XX')-'XY if XX is invertible, 
w= (10.12) 


(XX')'XY — otherwise. 


The matrix XX! can be computed in O(mN7?). The cost of its inversion or that of 
computing its pseudo-inverse is in O(N 3) 1 Finally, the multiplication with X and 
Y takes O(mN7?). Therefore, the overall complexity of computing the solution W 
is in O(mN? + N°). Thus, when the dimension of the feature space N is not too 
large, the solution can be computed efficiently. 

While linear regression is simple and admits a straightforward implementation, 
it does not benefit from a strong generalization guarantee, since it is limited to 
minimizing the empirical error without controlling the norm of the weight vector 
and without any other regularization. Its performance is also typically poor in most 
applications. The next sections describe algorithms with both better theoretical 
guarantees and improved performance in practice. 


1. In the analysis of the computational complexity of the algorithms discussed in this 
chapter, the cubic-time complexity of matrix inversion can be replaced by a more favorable 
complexity O(N?*’), with w = .376 using asymptotically faster matrix inversion methods 
such as that of Coppersmith and Winograd. 
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10.3.2 Kernel ridge regression 


We first present a learning guarantee for regression with bounded linear hypotheses 
in a feature space defined by a PDS kernel. This will provide a strong theoretical 
support for the kernel ridge regression algorithm presented in this section. The 
learning bounds of this section are given for the squared loss. Thus, in particular, 
the generalization error of a hypothesis h is defined by R(h) = E [(h(x) — f(x))?] 
when the target function is f. 


Theorem 10.7 

Let kK: X x X > R be a PDS kernel, ®: X — H a feature mapping associated to 
K, and H = {r+ w- ®(z): ||w||n < A}. Assume that there exists r > 0 such that 
K(a,2) < r? and |f(ax)| < Ar for all x € X. Then, for any 5 > 0, with probability 
at least 1 — 6, each of the following inequalities holds for all h € H: 


8r2.A? 1 /log+ 


R(h) < R(h) Ti a x (10.13) 


R(h) < R(h) _ yy es ; (10.14) 


Proof For all x € X, we have |w- ®(x)| < Al|®(x)|| < Ar, thus, for all x € V and 
h € H, |h(x) — f(x)| < 2Ar. By the bound on the empirical Rademacher complexity 
of kernel-based hypotheses (theorem 5.5), the following holds for any sample S$ of 
size m: 


= 2A2 
Fis (HH) < Ay/Tr[K] Zyl A . 


m a m 


which implies that %,,(H) < 4/ rh Plugging in this inequality in the first bound 
of theorem 10.3 with WM = 2Ar gives 


~ logi x 8r2A2 1 /log + 
R(h) < R(h) + 4M Rn (H) + M24] 5 R(h) Ta (1+ si] ; ) 


The second generalization bound is shown in a similar way by using the second 
bound of theorem 10.3. 


The first bound of the theorem just presented has the form R(h) < R(h) + AA?, 


a. 
with A= 52 (14 54/3 
minimization of an objective function that has precisely this form and thus is directly 


) = O( Tm): Kernel ridge regression is defined by the 
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motivated by the theoretical analysis just presented: 


m 
min F(w) = Aljw||? + 5° (w+ ®(a;) — yi)”. (10.15) 
w 
i=1 
Here, is a positive parameter determining the trade-off between the regularization 
term ||w||? and the empirical mean squared error. The objective function differs from 
that of linear regression only by the first term, which controls the norm of w. As in 
the case of linear regression, the problem can be rewritten in a more compact form 
as 


min F(W) = dW]? + [XW — YI, (10.16) 


where X € R‘*" is the matrix formed by the feature vectors, X = [®(21) ... &(em) |, 
W = w, and Y = (y,...,Y%m)!. Here too, F is convex, by the convexity of 
w +> ||w||? and that of the sum of two convex functions, and is differentiable. 
Thus F admits a global minimum at W if and only if 


VF(W) =06 (KX! +AIW=XY 6 W=(XX'4+AI)'XY. (10.17) 


Note that the matrix XX! + AI is always invertible, since its eigenvalues are the 
sum of the non-negative eigenvalues of the symmetric positive semidefinite matrix 
XX! and \ > 0. Thus, kernel ridge regression admits a closed-form solution. 

An alternative formulation of the optimization problem for kernel ridge regression 
equivalent to (10.15) is 


m 
min S“(w - ®(2;) — yi)? subject to: ||w||? < A’. 
i=1 
This makes the connection with the bounded linear hypothesis set of theorem 10.7 
even more evident. Using slack variables €;, for all i € [1,m], the problem can be 
equivalently written as 
m 
min S~& subject: to: (||w||? < A) A (Wi € [Lm], & = yi — w- ®(ai)). 
i=1 
This is a convex optimization problem with differentiable objective function and 
constraints. To derive the equivalent dual problem, we introduce the Lagrangian CL, 
which is defined for all €,w,a’, and \ > 0 by 


L(é,w,a',d) = 0 + Sally; — & — w- B(x) + Al(fwll? — A°). 
i=1 i=1 
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The KKT conditions lead to the following equalities: 


m 1 ™m 
Vwl=- » a, ®(x;) + 2Aw = 0 = Way » al, B (x;) 
VeL = 2& — ai =0 => §&=0;/2 


Vi € [1, m], a (y; — & — w- B(x;)) =0 
X(|pw|/? — 42) = 0. 


Plugging in the expressions of w and €;s in that of £ gives 


m 12 


m i) m m 
£= >t alu — So — FY ole @(x) (a) 
i=1 i=1 tj=l 


i=l 


+(Zall 32 @(23)? ~ 4”) 


1 m m 1 m 
aaa > a’ + x aYyi — D ¥ aja’, ®(x;)' &(x;) — AA? 
i=1 i=1 


tay 
=-AS (of +25 ain — S> axa; B(as)"&(a;) — A?, 
i=1 i=1 i,j=l 
with ai = 2\a;. Thus, the equivalent dual optimization problem for KRR can be 


written as follows: 


max —\a!a+2a'Y—a'(X'X)a, (10.18) 


acR™ 


or, more compactly, as 


max G(a) = —a'(K+Ala+2alY, (10.19) 


acR™ 


where K = X'X is the kernel matrix associated to the training sample. The 
objective function G is concave and differentiable. The optimal solution is obtained 
by differentiating the function and setting it to zero: 


VG(a) =0 = 2(1K+ADa=2Y — a=(K+Al)Y. (10.20) 


Note that (K + AI) is invertible, since its eigenvalues are the sum of the eigenvalues 
of the SPSD matrix K and \ > 0. Thus, as in the primal case, the dual optimization 
problem admits a closed-form solution. By the first KKT equation, w can be 
determined from @ by 


w=) a®(x;) = Xa = X(K +X). (10.21) 
i=1 
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The hypothesis h solution can be given as follows in terms of a: 
VreX, h(x) =w-8(2) =) aK (ai,2). (10.22) 
i=1 


Note that the form of the solution, h = >i", a;K(aj,-), could be immediately 
predicted using the Representer theorem, since the objective function minimized by 
KRR falls within the general framework of theorem 5.4. This also could show that w 
could be written as w = Xa. This fact, combined with the following simple lemma, 
can be used to determine q@ in a straightforward manner, without the intermediate 
derivation of the dual problem. 


Lemma 10.1 
The following identity holds for any matrix X: 


(XK! + AIX = X(K™X +d)". 


Proof Observe that (XX! +\I)K = X(X'X+4+A1). Left-multiplying by (XX! + 
AI)~! this equality and right-multiplying it by (X'X + \I)~! yields the statement 
of the lemma. m 


Now, using this lemma, the primal solution of w can be rewritten as follows: 
w= (XX! +A) XY = X(K'"X + AI-TY = X(K}+ ADDY. 


Comparing with w = Xa gives immediately a = (K + AI)“1Y. 

Our presentation of the KRR algorithm was given for linear hypotheses with no 
offset, that is we implicitly assumed b = 0. It is common to use this formulation 
and to extend it to the general case by augmenting the feature vector ®(x) with an 
extra component equal to one for all x € ¥ and the weight vector w with an extra 
component b € R. For the augmented feature vector ®/(z) € RY++ and weight 
vector w’ € RN+!, we have w’- ®'(x) = w- ®(x) +. Nevertheless, this formulation 
does not coincide with the general KRR algorithm where a solution of the form 
xt w- ®(x) +b is sought. This is because for the general KRR, the regularization 
term is \||w||, while for the extension just described it is A||w’|]. 

In both the primal and dual cases, KRR. admits a closed-form solution. Table 10.1 
gives the time complexity of the algorithm for computing the solution and the one 
for determining the prediction value of a point in both cases. In the primal case, 
determining the solution w requires computing matrix XX', which takes O(mN7?), 
the inversion of (XX! + AI), which is in O(N*), and multiplication with X, which 
is in O(mN?). Prediction requires computing the inner product of w with a feature 
vector of the same dimension that can be achieved in O(V). The dual solution first 
requires computing the kernel matrix K. Let « be the maximum cost of computing 
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Solution Prediction 
Primal | O(mN? + N°?) O(N) 
Dual | O(«Km? +m?) O(Km) 


Table 10.1 Comparison of the running-time complexity of KRR for computing 
the solution or the prediction value of a point in both the primal and the dual 
case. « denotes the time complexity of computing a kernel value; for polynomial 
and Gaussian kernels, « = O(V). 


K(a, 2’) for all pairs (x, x’) € X x X. Then, K can be computed in O(Km?). The 
inversion of matrix K + AI can be achieved in O(m?3) and multiplication with Y 


takes O(m?). Prediction requires computing the vector (K(x1,2),...,K(%m,2))" 
for some x € XY, which requires O(«m), and the inner product with a, which is in 
O(m). 


Thus, in both cases, the main step for computing the solution is a matrix inversion, 
which takes O(N?) in the primal case, O(m?) in the dual case. When the dimension 
of the feature space is relatively small, solving the primal problem is advantageous, 
while for high-dimensional spaces and medium-sized training sets, solving the dual 
is preferable. Note that for relatively large matrices, the space complexity could also 
be an issue: the size of relatively large matrices could be prohibitive for memory 
storage and the use of external memory could significantly affect the running time 
of the algorithm. 

For sparse matrices, there exist several techniques for faster computations of the 
matrix inversion. This can be useful in the primal case where the features can be 
relatively sparse. On the other hand, the kernel matrix K is typically dense; thus, 
there is less hope for benefiting from such techniques in the dual case. In such cases, 
or, more generally, to deal with the time and space complexity issues arising when 
mand N are large, approximation methods using low-rank approximations via the 
Nystr6m method or the partial Cholesky decomposition can be used very effectively. 

The KRR algorithm admits several advantages: it benefits from favorable theo- 
retical guarantees since it can be derived directly from the generalization bound we 
presented; it admits a closed-form solution, which can make the analysis of many 
of its properties convenient; and it can be used with PDS kernels, which extends its 
use to non-linear regression solutions and more general features spaces. KRR also 
admits favorable stability properties that we discuss in chapter 11. 

The algorithm can be generalized to learning a mapping from ¥ to R?, p > 1. 
This can be done by formulating the problem as p independent regression problems, 
each consisting of predicting one of the p target components. Remarkably, the 
computation of the solution for this generalized algorithm requires only a single 
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Figure 10.4 SVR attempts to fit a “tube” with width « to the data. Training data 
within the “epsilon tube” (blue points) incur no loss. 


matrix inversion, e.g., (K + AI)~! in the dual case, regardless of the value of p. 

One drawback of the KRR algorithm, in addition to the computational issues for 
determining the solution for relatively large matrices, is the fact that the solution it 
returns is typically not sparse. The next two sections present two sparse algorithms 
for linear regression. 


10.3.3 Support vector regression 


In this section, we present the support vector regression (SVR) algorithm, which 
is inspired by the SVM algorithm presented for classification in chapter 4. The 
main idea of the algorithm consists of fitting a tube of width € > 0 to the data, as 
illustrated by figure 10.4. As in binary classification, this defines two sets of points: 
those falling inside the tube, which are e-close to the function predicted and thus 
not penalized, and those falling outside, which are penalized based on their distance 
to the predicted function, in a way that is similar to the penalization used by SVMs 
in classification. 

Using a hypothesis set H of linear functions: H = {a +» w- ®(%) +b: w € 
R,b € R}, where ® is the feature mapping corresponding some PDS kernel K, 
the optimization problem for SVR can be written as follows: 


oo Vite 
mn 5 lw| + OD ly —(w- B(x;) +5), (10.23) 
where |- |. denotes the €-insensitive loss: 
Vyy EY, ly’ — yle = max(0, |y’ — y| — €). (10.24) 


The use of this loss function leads to sparse solutions with a relatively small 
number of support vectors. Using slack variables €; > 0 and & > 0, i € [1,ml, 
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the optimization problem can be equivalently written as 
ing gli? +0 (6 + 6) (10.25) 
subject to (w- ®(#;) +6) -— ys <€+& 

yi — (w+ B(a;) +b) Se + & 

& >0,€; > 0, Vi € [1, m]. 


This is a convex quadratic program (QP) with affine constraints. Introducing the 
Lagrangian and applying the KKT conditions leads to the following equivalent dual 
problem in terms of the kernel matrix K: 


1 
max — e(a’ + a)'1+(a’-a)'y 5 (a! a)'K(a’ — a) (10.26) 


subject to: (0< a<C)A(0< a’ <C)A((a’—a)'1=0). 


Any PDS kernel K can be used with SVR, which extends the algorithm to non-linear 
regression solutions. Problem (10.26) is a convex QP similar to the dual problem 
of SVMs and can be solved using similar optimization techniques. The solutions a 
and a’ define the hypothesis h returned by SVR. as follows: 


Vee X, h(x) = So (a) — a4) K(xi,x) +d, (10.27) 
i=1 
where the offset b can be obtained from a point x; with 0 < a; < C by 


b= —So(aj — a4) K (ai, 2) + yj +6 (10.28) 


i=l 
or from a point x; with 0 < ai, < C via 


b=—) (af —a;)K(xi,2;) + yj —€. (10.29) 
i=1 


By the complementarity conditions, for all 7 € [1, mJ], the following equalities hold: 


ai; ((w - ®(2;) +b) —y, —e- &) =0 
ai, ((w - B(a;) +b) — yi +e + &) =0. 


Thus, if a; 4 0 or a ¥ 0, that is if x; is a support vector, then, either (w - ®(#;) + 
b) — y; — € = &; holds or y; — (w- ®(a2;) +b) —e€ = &. This shows that support vectors 
points lying outside the e-tube. Of course, at most one of a; or aj, is non-zero for 
any point 2;: the hypothesis either overestimates or underestimates the true label 
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by more than e. For the points within the «tube, we have a; = a = 0; thus, 
these points do not contribute to the definition of the hypothesis returned by SVR. 
Thus, when the number of points inside the tube is relatively large, the hypothesis 
returned by SVR is relatively sparse. The choice of the parameter € determines a 
trade-off between sparsity and accuracy: larger € values provide sparser solutions, 
since more points can fall within the e-tube, but may ignore too many key points 
for determining an accurate solution. 

The following generalization bounds hold for the €insensitive loss and kernel- 
based hypotheses and thus for the SVR algorithm. We denote by D the distribution 
according to which sample points are drawn and by D the empirical distribution 
defined by a training sample of size m. 


Theorem 10.8 

Let kK: & x X —R be a PDS kernel, let ®: X — H be a feature mapping associated 
to K and let H = {x ~w- ®(a): ||wl|q_ < A}. Assume that there exists r > 0 such 
that K(x,x) < r? and |f(ax)| < Ar for allaz € X. Fire > 0. Then, for any 6 > 0, 
with probability at least 1— 6, each of the following inequalities holds for allh € H, 


2rA log + 
Byllse) ~ Fld BLM) - fe)ld-+ FE (1+ 5) 
rA Tr[K] | log = 
5 =a (\/ SA + ay S*). 


LB h(a) — Fl@)led $B {Ih(2) — F@)le + 
Proof Let H, = {@ |h(x)—f(x)|-: h © H} and let H’ = {x h(x)— f(x): he 
H}. Note that the function ®,: 7 + |az|. is 1-Lipschitz. Thus, by Talagrand’s lemma 
(lemma 4.2), we have Rs(H-) < Ry(H’). By the proof of theorem 10.2, the equality 
Ry(H') =Rs(H) holds, thus Rg(He) < Rs(H). 

As in the proof of theorem 10.7, for all x € ¥ and h € H, we have |h(x) — f(a)| < 
2Ar and Ryp(H) < 4/ ae By the general Rademacher complexity learning bound 
of theorem 3.1, for any 6 > 0, with with probability at least 1— 06, the following 
learning bound holds with M = 2Ar: 


Bl|h(w) — f(@)le] < Blla(w) — f(a)l] + 2%m(H) + use , 


Using Rin(H) < 4/ rA* yields the first statement of the theorem. The second 


m 
statement is shown in a similar way. 


These results provide strong theoretical guarantees for the SVR algorithm. Note, 
however, that the theorem does not provide guarantees for the expected loss of the 
hypotheses in terms of the squared loss. For 0 < € < 1/4, the inequality |x|? < |z|. 
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holds for all # in [—7!,—7.] U [ne, nt] with ne = inyinde and 1 = 
small values of €, 7. ¥ 0 and 7 = 1, thus, if M = 2r\ < 1, then, the squared loss 
can be upper bounded by the e-insensitive loss for almost all values of (h(a) — f(x)) 
in [—1,1] and the theorem can be used to derive a useful generalization bound for 
the squared loss. 

More generally, if the objective is to achieve a small squared loss, then, SVR 
can be modified by using the quadratic €-insensitive loss, that is the square of the 
e-insensitive loss, which also leads to a convex QP. We will refer by quadratic SVR 
to this version of the algorithm. Introducing the Lagrangian and applying the KKT 
conditions leads to the following equivalent dual optimization problem for quadratic 
SVR in terms of the kernel matrix K: 


1+V1—4e 3 
1¢vI-4e | For 


1 1 
f 4 T4 4 / TT, / T —_ 
mie e(a’ +a)'1+(a'—-—a)'y 5 (a a) (K + =I) (a’ — a) 
(10.30) 
subject to: (a > 0) A (a@>0) A (a’ —a)'1=0). 


Any PDS kernel K can be used with quadratic SVR, which extends the algorithm to 
non-linear regression solutions. Problem (10.30) is a convex QP similar to the dual 
problem of SVMs in the separable case and can be solved using similar optimization 
techniques. The solutions a@ and a’ define the hypothesis h returned by SVR as 
follows: 

h(x) = S “(ai — ai)K (i, x) +, (10.31) 

i=1 

where the offset 6 can be obtained from a point 7; with 0 <a; <C or0< a <C 
exactly as in the case of SVR with (non-quadratic) e-insensitive loss. Note that for 
€ = 0, the quadratic SVR algorithm coincides with KRR as can be seen from the 
dual optimization problem (the additional constraint (a’ — a)'1=0 appears here 
due to use of an offset b). The following generalization bound holds for quadratic 
SVR. It can be shown in a way that is similar to the proof of theorem 10.8 using 
the fact that the quadratic ¢-insensitive function x +> |x|? is 2-Lipschitz. 


Theorem 10.9 

Let kK: X¥ x & — R be a PDS kernel, ®: X — H a feature mapping associated to 
K, and H = {x w- ®(z): ||w|lm < A}. Assume that there exists r > 0 such that 
K(a,x) < r? and |f(x)| < Ar for alla € X. Fixe >0. Then, for any 6 > 0, with 
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quadratic €-insensitive 
2 


at+max(0, |x| — €) 
6 Huber 
ba if |z|<c 
re . ‘a 
2c\x|—c? otherwise. 
a 
04 ‘ ve 
ae €-insensitive 


amax(0, |2| — €) 


4 2 0 2 4 


Figure 10.5 Alternative loss functions that can be used in conjunction with SVR. 


probability at least 1 — 6, each of the following inequalities holds for allh € H: 


2 / 2 8r2A? 1 log 4 
ppl) Del Fee) 7a ae eae a 


9 oy Bee Tr[K] | 3, /log§ 
eee ee arco ve WN me a = 


This theorem provides a strong justification for the quadratic SVR algorithm. Alter- 
native convex loss functions can be used to define regression algorithms, in particular 
the Huber loss (see figure 10.5), which penalizes smaller errors quadratically and 
larger ones only linearly. 

SVR admits several advantages: the algorithm is based on solid theoretical 
guarantees, the solution returned is sparse, and it allows a natural use of PDS 
kernels, which extend the algorithm to non-linear regression solutions. SVR also 
admits favorable stability properties that we discuss in chapter 11. However, one 
drawback of the algorithm is that it requires the selection of two parameters, C’ 
and e. These can be selected via cross-validation, as in the case of SVMs, but this 
requires a relatively larger validation set. Some heuristics are often used to guide 
the search for their values: C is searched near the maximum value of the labels in 
the absence of an offset (b = 0) and for a normalized kernel, and € is chosen close to 
the average difference of the labels. As already discussed, the value of € determines 
the number of support vectors and the sparsity of the solution. Another drawback of 
SVR is that, as in the case of SVMs or KRR, it may be computationally expensive 
when dealing with large training sets. One effective solution in such cases, as for 


KRR, consists of approximating the kernel matrix using low-rank approximations 
via the Nystr6m method or the partial Cholesky decomposition. In the next section, 
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we discuss an alternative sparse algorithm for regression. 
10.3.4 Lasso 


Unlike the KRR and SVR algorithms, the Lasso (least absolute shrinkage and 
selection operator) algorithm does not admit a natural use of PDS kernels. Thus, 
here, we assume that the input space ¥ is a subset of R% and consider a family of 
linear hypotheses H = {rm > w-x+b: wE R*,dE RY}. 

Let S = (titi lpsss5 tating I € (X x Y)™ be a labeled training sample. 
Lasso is based on the minimization of the empirical squared error on S$ with a 
regularization term depending on the norm of the weight vector, as in the case of 
the ridge regression, but using the ZL, norm instead of the Lz norm and without 


squaring the norm: 
m 

min F(w,b) = Allwlli+ So (w-x;+b- yi) . (10.32) 

w,bd ra 
Here A denotes a positive parameter as for ridge regression. This is a convex 
optimization problem, since ||- ||, is convex as with all norms and since the empirical 
error term is convex, as already discussed for linear regression. The optimization 
for Lasso can be written equivalently as 


m 
min > (w-x;+b— yi)” subject to: |[wl]i < Ai, (10.33) 
w,b — 
where A, is a positive parameter. 

The key property of Lasso as in the case of other algorithms using the Ly 
norm constraint is that it leads to a sparse solution w, that is one with few 
non-zero components. Figure 10.6 illustrates the difference between the L, and Lz 
regularizations in dimension two. The objective function of (10.33) is a quadratic 
function, thus its contours are ellipsoids, as illustrated by the figure (in blue). The 
areas corresponding to LZ, and Lz balls of a fixed radius A, are also shown in the 
left and right panel (in red). The Lasso solution is the point of intersection of the 
contours with the L; ball. As can be seen form the figure, this can typically occur 
at a corner of the L; ball where some coordinates are zero. In contrast, the ridge 
regression solution is at the point of intersection of the contours and the Lz ball, 
where none of the coordinates is typically zero. 

The following results show that Lasso also benefits from strong theoretical guar- 
antees. We first give a general upper bound on the empirical Rademacher complexity 
of L, norm-constrained linear hypotheses . 


Theorem 10.10 Rademacher complexity of linear hypotheses with bounded 
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Figure 10.6 Comparison of the Lasso and ridge regression solutions. 


Ly, norm 

Let & C RN and let S = ((x1,y1),-.-,(%msYm)) € (X x Y)™ be a sample of 
size m. Assume that for all i € [1,m], ||xilloo < Too for some ro, > 0, and let 
H={xe€XHw-x: |lwl|; < Ai}. Then, the empirical Rademacher complexity of 
HT can be bounded as follows: 


2r2,A7 log(2N) 


Ry(H) < (10.34) 
m 
Proof For any i € [1,m] we denote by x;; the jth component of x;. 
i 1 iis 
Rs(H)=—E] sup So iw + x; 
mo | Ilwili<Ar i=l 
Ay ~ i 
=—E > OiXi (by definition of the dual norm) 
mo j=l love) 
= Aly max = O5%i; (by definition of |] - ||.) 
m . jElL,N] 7 


Ay 
=—E Fey by definiti f |] + |leo 
SE ai cP Stee] er dein fn 


“Lg - Slay », val ; 


ZEA’. 


where A denotes the set of N vectors {s(%1j,---;%mj)': 9 € [1, N],s € {-1,+1}}. 
For any z € A, we have |lz|lo < /mr2, = rom. Thus, by Massart’s lemma 
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(theorem 3.3), since A contains at most 2 elements, the following inequality holds: 


a 2 log(2N 
Ri) Sgr EE oe ng) EN 
m m 


which concludes the proof. 


Note that dependence of the bound on the dimension N is only logarithmic, which 
suggests that using very high-dimensional feature spaces does not significantly affect 
generalization. 

Using the Rademacher complexity bound just proven and the general result of 
theorem 10.3, the following generalization bound can be shown to hold for the 
hypothesis set used by Lasso, using the squared loss. 


Theorem 10.11 

Let ¥ C RN and H = {x € X Hw-x: ||wlli < Ai}. Assume that there exists 
Too > 0 such for all x € X, ||x|lo00 < Too and |f(x)| < Airs. Then, for any 6 > 0, 
with probability at least 1— 6, each of the following inequalities holds for allh € H: 


8r2, At 1 /log + 


R(h) < R(h) + s/f log(2N) + 5 5 


(10.35) 


Proof For allx € %, by Hélder’s inequality, we have |w-x| < ||w||1||X|o0 < Ait, 
thus, for all h € H, |h(x) — f(x)| < 2r.A,. Plugging in the inequality of 


theorem 10.10 in the bound of theorem 10.3 with M = 2r,,A, gives 


log + 
2m 


2log(2N 
og ( ee 


R(h) < R(h) + 8r2,A? (2ro0A4)? 


x 


which can be simplified and written as (10.35). m 


As in the case of ridge regression, we observe that the objective function minimized 
by Lasso has the same form as the right-hand side of this generalization bound. 

There exist a variety of different methods for solving the optimization problem of 
Lasso, including an efficient algorithm (Lars) for computing the entire regularization 
path of solutions, that is, the Lasso solutions for all values of the regularization 
parameter A, and other on-line solutions that apply more generally to optimization 
problems with an LZ, norm constraint. 

Here, we show that the Lasso problems (10.32) or (10.33) are equivalent to a 
quadratic program (QP), and therefore that any QP solver can be used to compute 
the solution. Observe that any weight vector w can be written as w = w* —w-, 
with wt > 0, w~ > 0, and wy = 0 or w; = 0 for any 7 € [1, N], which implies 
|w||1 = poe w; + w;. This can be done by defining the jth component of w* as 


j 
w; if w; > 0, 0 otherwise, and similarly the 7th component of w~ as —w,; if w; < 0, 
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0 otherwise, for any 7 € [1, N]. With the replacement w = wt — w—, with w* > 0, 
w > 0, and ||/w||, = se wy + w; , the Lasso problem (10.32) becomes 


N m 
2 


min ASC (wi + w7) + 45 ((wt —w-)- xi +0- 4) 


wt>0,w->0,b 


(10.36) 


jel i=1 


Conversely, a solution w = wt — w7 of (10.36) verifies the condition w; = 0or 
w; =0 for any j € [1, N], thus w; = w; when w; > 0 and wj = —w; when w; < 0. 
This is because if 6; = min(w; ,w; ) > 0 for some j € [1, N], replacing wy with 
(wy — 6;) and w; with (w; —46;) would not affect wy uw; = (wy =o) — (ie =a), 
but would reduce the term (wy + w; ) in the objective function by 26; > 0 and 
provide a better solution. In view of this analysis, problems (10.32) and (10.36) 
admit the same optimal solution and are equivalent. Problem (10.36) is a QP since 
the objective function is quadratic in w*, w~, and 6, and since the constraints are 
affine. With this formulation, the problem can be straightforwardly shown to admit 
a natural online algorithmic solution (exercise 10.10).? 

Thus, Lasso has several advantages: it benefits from strong theoretical guarantees 
and returns a sparse solution, which is advantageous when there are accurate 
solutions based on few features. The sparsity of the solution is also computationally 
attractive; sparse feature representations of the weight vector can be used to make 
the inner product with a new vector more efficient. The algorithm’s sparsity can also 
be used for feature selection. The main drawback of the algorithm is that it does not 
admit a natural use of PDS kernels and thus an extension to non-linear regression, 
unlike KRR and SVR. One solution is then to use empirical kernel maps, as discussed 
in chapter 5. Also, Lasso’s solution does not admit a closed-form solution. This is 
not a critical property from the optimization point of view but one that can make 
some mathematical analyses very convenient. 


10.3.5 Group norm regression algorithms 


Other types of regularization aside from the L; or L2 norm can be used to define 
regression algorithms. For instance, in some situations, the feature space may be 
naturally partitioned into subsets, and it may be desirable to find a sparse solution 
that selects or omits entire subsets of features. A natural norm in this setting is 
the group or mixed norm L21, which is a combination of the LZ; and Lz norms. 
Imagine that we partition w € R as wi,...,w,z, where wy Rs forl<j<k 
and >), Nj; = N, and define W = (w],...,w,)'. Then the Lz, norm of W is 


2. The technique we described to avoid absolute values in the objective function can be 
used similarly in other optimization problems. 
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WIDROWHOFF(wo) 
1 wi—wo > typically wo = 0 
2 fort<1toT do 
3 RECEIVE(X;) 
4 Ut — Wt * Xt 
5 RECEIVE(Yz) 
6 Wisi — we + 2n(we- xX: — yt)Xe > learning rate 7 > 0. 
7 return wr+i 


Figure 10.7 The Widrow-Hoff algorithm. 
defined as 
k 
[Wl2.1 = do llwsll 
j=l 


Combining the Lz norm with the empirical mean squared error leads to the Group 
Lasso formulation. More generally, an Ly, group norm regularization can be used 
for q,p > 1 (see appendix A for the definition of group norms). 


10.3.6 On-line regression algorithms 


The regression algorithms presented in the previous sections admit natural on- 
line versions. Here, we briefly present two examples of these algorithms. These 
algorithms are particularly useful for applications to very large data sets for which 
a batch solution can be computationally too costly to derive and more generally in 
all of the on-line learning settings discussed in chapter 7. 

Our first example is known as the Widrow-Hoff algorithm and coincides with 
the application of stochastic gradient descent techniques to the linear regression 
objective function. Figure 10.7 gives the pseudocode of the algorithm. A similar 
algorithm can be derived by applying the stochastic gradient technique to ridge 
regression. At each round, the weight vector is augmented with a quantity that 
depends on the prediction error (wz - x: — yt). 

Our second example is an online version of the SVR algorithm, which is obtained 
by application of stochastic gradient descent to the dual objective function of SVR. 
Figure 10.8 gives the pseudocode of the algorithm for an arbitrary PDS kernel K 
in the absence of any offset (b = 0). Another on-line regression algorithm is given 
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ONLINEDUALSVR() 
1 a«0O 
2 a’ —~O 
3 fort<—1toT do 
4 RECEIVE(Zz) 
Be — Dyan ~ a6) K (ws, 01) 
6 RECEIVE(Yz) 
7 41 — a + min(max(n(ye — H — €), —a%4), C — a4) 
8 Q441 — Oo + min(max(7(Y — yz — €), —a4),C — at) 
9 return ey aK (x4, -) 


Figure 10.8 An on-line version of dual SVR. 


by exercise 10.10 for Lasso. 


10.4 Chapter notes 


The generalization bounds presented in this chapter are for bounded regression 
problems. When {a +> L(h(x), f(x)): h € H}, the family of losses of the hypotheses, 
is not bounded, a single function can take arbitrarily large values with arbitrarily 
small probabilities. This is the main issue for deriving uniform convergence bounds 
for unbounded losses. This problem can be avoided either by assuming the existence 
of an envelope, that is a single non-negative function with a finite expectation lying 
above the absolute value of the loss of every function in the hypothesis set [Dudley, 
1984, Pollard, 1984, Dudley, 1987, Pollard, 1989, Haussler, 1992], or by assuming 
that some moment of the loss functions is bounded [Vapnik, 1998, 2006]. Cortes, 
Mansour, and Mohri [2010a] give two-sided generalization bounds for unbounded 
losses with finite second moments. The one-sided version of their bounds coincides 
with that of Vapnik [1998, 2006] modulo a constant factor, but the proofs given by 
Vapnik in both books seem to be incorrect. 

The Rademacher complexity bounds given for regression in this chapter (theo- 
rem 10.2) are novel. The notion of pseudo-dimension is due to Pollard [1984]. Its 
equivalent definition in terms of VC-dimension is discussed by Vapnik [2000]. The 
notion of fat-shattering was introduced by Kearns and Schapire [1990]. The linear 
regression algorithm is a classical algorithm in statistics that dates back at least to 
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the nineteenth century. The ridge regression algorithm is due to Hoerl and Kennard 
1970]. Its kernelized version (KRR) was introduced and discussed by Saunders, 
Gammerman, and Vovk [1998]. An extension of KRR to outputs in R? with p > 1 
with possible constraints on the regression is presented and analyzed by Cortes, 
Mohri, and Weston [2007c]. The support vector regression (SVR) algorithm is dis- 
cussed in Vapnik [2000]. Lasso was introduced by Tibshirani [1996]. The LARS 
algorithm for solving its optimization problem was later presented by Efron et al. 
2004]. The Widrow-Hoff on-line algorithm is due to Widrow and Hoff [1988]. The 
dual on-line SVR algorithm was first introduced and analyzed by Vijayakumar and 
Wu [1999]. The kernel stability analysis of exercise 9.3 is from Cortes et al. [2010b]. 

For large-scale problems where a straightforward batch optimization of a primal or 
dual objective function is intractable, general iterative stochastic gradient descent 
methods similar to those presented in section 10.3.6, or quasi-Newton methods 
such as the limited-memory BFGS (Broyden-Fletcher-Goldfard-Shanno) algorithm 
[Nocedal, 1980] can be practical alternatives in practice. 

In addition to the linear regression algorithms presented in this chapter and their 
kernel-based non-linear extensions, there exist many other algorithms for regression, 


including decision trees for regression (see chapter 8), boosting trees for regression, 
and artificial neural networks. 


10.5 Exercises 


10.1 Pseudo-dimension and monotonic functions. 


Assume that ¢ is a strictly monotonic function and let ¢o H be the family of 
functions defined by ¢0 H = {¢(h(-)) : h € H}, where H is some set of real-valued 
functions. Show that Pdim(¢o H) = Pdim(#). 


10.2 Pseudo-dimension of linear functions. Let H be the set of all linear functions 
in dimension d, i.e. h(x) = w!x for some w € R?. Show that Pdim(H) = d. 


10.3 Linear regression. 


(a) What condition is required on the data X in order to guarantee that XX! 
is invertible? 

(b) Assume the problem is under-determined. Then, we can choose a solution 
w such that the equality X'w = X'(XX')'Xy (which can be shown to 
equal X'Xy) holds. One particular choice that satisfies this equality is w* = 
(XX!')iXy. However, this is not the unique solution. As a function of w*, 
characterize all choices of w that satisfy X'w = X'Xy (Hint: use the fact 
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that XXT[X = X). 


10.4 Perturbed kernels. Suppose two different kernel matrices, K and K’, are used to 
train two kernel ridge regression hypothesis with the same regularization parameter 
A. In this problem, we will show that the difference in the optimal dual variables, 
a and a’ respectively, is bounded by a quantity that depends on ||/K’ — K]|2. 


(a) Show a’ — a = ((K’ + AI)~1(K’ — K)(K + AI)“1)y. (Hint: Show that for 
any invertible matrix M, M’t — M! = —M’~!(M’ — M)M71.) 
(b) Assuming Vy € Y, |y| < M, show that 
ViiM|K! — Kl 
d? ; 


lla’ — al| < 


10.5 Huber loss. Derive the primal and dual optimization problem used to solve the 
SVR problem with the Huber loss: 


ie)= 4 BP if |i] <¢ 


c€; — 4c?, otherwise 


where €; = w- ®(x;) +b — y. 


10.6 SVR and squared loss. Assuming that 2rA < 1, use theorem 10.8 to derive a 
generalization bound for the squared loss. 


10.7 SVR dual formulations. Give a detailed and carefully justified derivation of 
the dual formulations of the SVR algorithm both for the e-insensitive loss and the 
quadratic e¢-insensitive loss. 


10.8 Optimal kernel matrix. Suppose in addition to optimizing the dual variables 
a € R™, as in (10.19), we also wish to optimize over the entries of the PDS kernel 
matrix K € R™*™, 


min max—A\a'!a—a!Ka+2a'ly, st. |Kll2 <1 
K=0 @ 


(a) What is the closed-form solution for the optimal K for the joint optimiza- 
tion? 

(b) Optimizing over the choice of kernel matrix will provide a better value of 
the objective function. Explain, however, why the resulting kernel matrix is not 
useful in practice. 
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ONLINELASSO(W@, Wo ) 

1 wf ow pwd >0 
Ww, — Wo bw 20 
for t~_1to T do 

RECEIVE(Xz, Yt) 
for 7 1to N do 


fj <— max 0, wy; 1) A [ye (wi Ww, ) : Xz Xt j 


b+, <— max (0,w;,; — n|A4 [ue (w; —w; ) Xt] Xt 


OonN on» F WwW WY 


+ as 
return Writ - Wray 


Figure 10.9 On-line algorithm for Lasso. 


10.9 Leave-one-out error. In general, the computation of the leave-one-out error 
can be very costly since, for a sample of size m, it requires training the algorithm 
m times. The objective of this problem is to show that, remarkably, in the case 
of kernel ridge regression, the leave-one-out error can be computed efficiently by 
training the algorithm only once. 


Let S = ((21,y1),---;(@m;Ym)) denote a training sample of size m and for any 

€ [1, ml], let S; denote the sample of size m — 1 obtained from S by removing 
(ai, yi): Si = S—{(x;,yi)}. For any sample T, let hy denote a hypothesis obtained 
by training 7’. By definition (see definition 4.1), for the squared loss, the leave-one- 
out error with respect to S is defined by 


A al a 

Rioo(KRR) = — dhs (x;) — yi)?. 
(a) Let S; = ((21,y1),---, (xi, hs;(Yi)),---,(@m,Ym)). Show that hs, = hgy. 
(b) Define y; = y — ye; + hg,(x;)e;, that is the vector of labels with the 
ith component replaced with hg,(a;). Prove that for KRR hg,(a;) = y; (K + 
AT) Ke. 
(c) Prove that the leave-one-out error admits the following simple expression 
in terms of hg: 


a hg( (x3) —% : 
Rioo(KRR) = l= TK 4 AD-1Ke,| (10.37) 
ma % 
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(d) Suppose that the diagonal entries of matrix M = (K+AI)~1K are all equal 
to y. How do the empirical error R of the algorithm and the leave-one-out error 
Roo relate? Is there any value of y for which the two errors coincide? 


10.10 On-line Lasso. Use the formulation (10.36) of the optimization problem of 
Lasso and stochastic gradient descent (see section 7.3.1) to show that the problem 
can be solved using the on-line algorithm of figure 10.9. 


10.11 On-line quadratic SVR. Derive an on-line algorithm for the quadratic SVR 
algorithm (provide the full pseudocode). 


11 Algorithmic Stability 


In chapters 2—4 and several subsequent chapters, we presented a variety of general- 
ization bounds based on different measures of the complexity of the hypothesis set 
HT used for learning, including the Rademacher complexity, the growth function, 
and the VC-dimension. These bounds ignore the specific algorithm used, that is, 
they hold for any algorithm using H as a hypothesis set. 

One may ask if an analysis of the properties of a specific algorithm could lead 
to finer guarantees. Such an algorithm-dependent analysis could have the benefit 
of a more informative guarantee. On the other hand, it could be inapplicable to 
other algorithms using the same hypothesis set. Alternatively, as we shall see in 
this chapter, a more general property of the learning algorithm could be used to 
incorporate algorithm-specific properties while extending the applicability of the 
analysis to other learning algorithms with similar properties. 

This chapter uses the property of algorithmic stability to derive algorithm- 
dependent learning guarantees. We first present a generalization bound for any 
algorithm that is sufficiently stable. Then, we show that the wide class of kernel- 
based regularization algorithms enjoys this property and derive a general upper 
bound on their stability coefficient. Finally, we illustrate the application of these 
results to the analysis of several algorithms both in the regression and classification 
settings, including kernel ridge regression (KRR), SVR, and SVMs. 


11.1 Definitions 


We start by introducing the notation and definitions relevant to our analysis of 
algorithmic stability. We denote by z a labeled example (x,y) € ¥Y x Y. The 
hypotheses h we consider map X to a set JY’ sometimes different from Y. In 
particular, for classification, we may have Y = {—1,+1} while the hypothesis h 
learned takes values in R. The loss functions DL we consider are therefore defined 
over Y’ x Y, with Y’ = Y in most cases. For a loss function L: Y’ x Y > R., we 
denote the loss of a hypothesis h at point z by L,(h) = L(h(x),y). We denote by 
D the distribution according to which samples are drawn and by H the hypothesis 
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set. The empirical error or loss of h € H on a sample S = (z,...,%m) and its 
generalization error are defined, respectively, by 


Given an algorithm A, we denote by hg the hypothesis hy € H returned by A when 
trained on sample S. We will say that the loss function DL is bounded by M > 0 if 
for allh € H and z€ 4X x Y, Lz(h) < M. For the results presented this chapter, a 
weaker condition suffices, namely that D,(hs) < M for all hypotheses hg returned 
by the algorithm A considered. 

We are now able to define the notion of uniform stability, the algorithmic property 
used in the analyses of this chapter. 


Definition 11.1 Uniform stability 

Let S and S" be any two training samples that differ by a single point. Then, a 
learning algorithm A is uniformly (-stable if the hypotheses it returns when trained 
on any such samples S and S' satisfy 


Vee 2, |Lz(hs) — Lz(hs’)| <p. 
The smallest such 3 satisfying this inequality is called the stability coefficient of A. 


In other words, when A is trained on two similar training sets, the losses incurred by 
the corresponding hypotheses returned by A should not differ by more than @. Note 
that a uniformly $-stable algorithm is often referred to as being 3-stable or even just 
stable (for some unspecified 3). In general, the coefficient 3 depends on the sample 
size m. We will see in section 11.2 that @ = o(1/,/m) is necessary for the convergence 
of the stability-based learning bounds presented in this chapter. In section 11.3, we 
will show that a more favorable condition holds, that is, 6 = O(1/m), for a wide 
family of algorithms. 


11.2 Stability-based generalization guarantee 


In this section, we show that exponential bounds can be derived for the generaliza- 
tion error of stable learning algorithms. The main result is presented in theorem 11.1. 


Theorem 11.1 
Assume that the loss function L is bounded by M > 0. Let A be a (-stable learning 
algorithm and let S be a sample of m points drawn 1.1.d. according to distribution D. 
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Then, with probability at least 1 — 6 over the sample S drawn, the following holds: 


= log + 
R(hs) < R(hs) + B+ (2mB + M)\/ —. 


Proof The proof is based on the application of McDiarmid’s inequality (theo- 
rem D.3) to the function ® defined for all samples $ by (5) = R(hg) — R(hg). Let 
S’ be another sample of size m with points drawn i.i.d. according to D that differs 
from S by exactly one point. We denote that point by z,, in S, 2/, in S’, i-e., 


S = (2,-0.5; 2m-1%m) and 8" = (252.65 Sm—15 Zp) 
By definition of ®, the following inequality holds: 
|®(S’) — (5) < |R(hs") — R(hs)| + |A(hs") — R(hs)|. (11.1) 
We bound each of these two terms separately. By the 3-stability of A, we have 
IR(hs) — R(hs’)| = |E[Le(hs)] — BlLe(hsIl $ BllLe(hs) — Lelhs Il < 8. 


Using the boundedness of L along with (-stability of A, we also have 


m—-1 
a a 1 
( S> Lz,(hs) - L.,(hs’)) + Lz, (hg) — Lz, (hsv) 
i=1 


|R(hs) — R(hs:)| = 


m 


m—1 
< +|( » |Lz,(hs) — L.,(hs)|) + |Lz,, (hs) - beth) 
w=1 
ples eee ua 
mm m m 


Thus, in view of (11.1), ® satisfies the condition |®(S) — (S’)| < 26+ “4. By 
applying McDiarmid’s inequality to ®('), we can bound the deviation of ® from 
its mean as 


Pr [®(S) > € + E[(5)]] < exp (oa) 


or, equivalently, with probability 1 — 6, 


(5) <e+E[0(S)], (11.2) 


—2me? 


where 6 = exp (eae) . If we solve for € in this expression for 6, plug into (11.2) 
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and rearrange terms, then, with probability 1 — 6, we have 


oa log 4 
(5) < , B [B(S)] + (2me 4 M)\/ a (11.3) 


We now bound the expectation term, first noting that by linearity of expectation 


n~ 


Es[®(S)] = Es[R(hg)] — Es[R(hs)]. By definition of the generalization error, 


E [R(As)|= .E [ E [£-(hs)]]=_ E_. [Lz(hs)]. (11.4) 


SxD™ SxD™ bznD Sjz0Dmti 


By the linearity of expectation, 


a 1 
pn lR(hs)] =e 


m 
{= 


sBlbalhs = EB Ea(hs), (11.8) 


an 


where the second equality follows from the fact that the z; are drawn i.i.d. and thus 
the expectations Esvpm[Lz,(hg)], i € [1,mJ, are all equal. The last expression in 
(11.5) is the expected loss of a hypothesis on one of its training points. We can 
rewrite it as Egvpm|Lz,(hs)] = Egz.pm+i[Lz(hs-)], where S’ is a sample of m 
points containing z extracted from the m+ 1 points formed by S and z. Thus, in 
view of (11.4) and by the (-stability of A, it follows that 


| acho POM 7 | geen!) - Spantimaaea(hs’)]| 
< eee [|L-(hs) _ L.(hs’)|] 
<_ El =8. 
S,e~nDmtt 


We can thus replace Es[®(S)] by @ in (11.3), which completes the proof. 


The bound of the theorem converges for (m3) /./m = o(1), that is 6 = o(1/./m). In 
particular, when the stability coefficient ( is in O(1/m), the theorem guarantees that 
R(hs) —R(hg) = O(1/./m) with high probability. In the next section, we show that 
kernel-based regularization algorithms precisely admit this property under some 


general assumptions. 


11.3. Stability of kernel-based regularization algorithms 


Let K be a positive definite symmetric kernel, H the reproducing kernel Hilbert 
space associated to K, and ||- ||« the norm induced by K in H. A kernel-based 
regularization algorithm is defined by the minimization over H of an objective 
function F's based on a training sample S = (21,...,2m) and defined for all h € H 
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Figure 11.1 Illustration of the quantity measured by the Bregman divergence 
defined based on a convex and differentiable function F. The divergence measures 
the distance between F'(y) and the hyperplane tangent to the curve at point x. 


by: 
Fs(h) = Rg(h) + AllAllx. (11.6) 
In this equation, Rg(h) = +150", L.,(h) is the empirical error of hypothesis h with 


respect to a loss function Z and \ > 0 a trade-off parameter balancing the emphasis 
on the empirical error versus the regularization term ||h||%. The hypothesis set H 
is the subset of H formed by the hypotheses possibly returned by the algorithm. 
Algorithms such as KRR, SVR and SVMs all fall under this general model. 

We first introduce some definitions and tools needed for a general proof of an 
upper bound on the stability coefficient of kernel-based regularization algorithms. 
Our analysis will assume that the loss function LZ is convex and that it further 
verifies the following Lipschitz-like smoothness condition. 


Definition 11.2 o-admissibility 
A loss function L is o-admissible with respect to the hypothesis class H if there 
exists o © R, such that for any two hypotheses h,h' © H and for all (x,y) EX xy, 


|L(h'(x), y) — L(h(x), y)| < olh'(a) — A(a)]. (11.7) 


This assumption holds for the quadratic loss and most other loss functions where 
the hypothesis set and the set of output labels are bounded by some M € Ry: 
Vh € H,Va € X,|h(x)| < M and Vy € Y,|y| < M. 

We will use the notion of Bregman divergence, Br which can be defined for any 
convex and differentiable function F’: H — R as follows: for all f,g € H, 


Br(fllg) = F(f) — Fg) — (f -—9, VF (g9))- 


Figure 11.1 illustrates the geometric interpretation of the Bregman divergence. We 
generalize this definition to cover the case of convex but non-differentiable loss 
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F(h) 


6F(h) 


Figure 11.2 Illustration of the notion of sub-gradient: elements of the subgradient 
set OF (h) are shown in red at point h, for the function F shown in blue. 


functions F’ by using the notion of subgradient. For a convex function F’: H — R, 
we denote by OF (h) the subgradient of F' at h, which is defined as follows: 


OF(h) = {g € H: Vi! CH, F(h’) — F(h) > (h’ —h,g)}. 


Thus, OF'(h) is the set of vectors g defining a hyperplane supporting function F at 
point h (see figure 11.2). OF (h) coincides with VF (h) when F is differentiable at h, 
i.e. OF (h) = {VF (h)}. Note that at a point h where F is minimal, 0 is an element 
of OF (h). Furthermore, the subgradient is additive, that is, for two convex function 
Fy and Fo, O(F, + F2)(h) = {91 + 92: 91 € OF \(h), g2 © OF 2(h)}. For any h € H, 
we fix 0F(h) to be an (arbitrary) element of OF(h). For any such choice of dF’, we 
can define the generalized Bregman divergence associated to F' by: 


Vh',h € H, Bp(h’||h) = F(h’) — F(h) — (rh! — h, 6F(h)) - (11.8) 


Note that by definition of the subgradient, Br(h’||h) > 0 for all h’,h € H. 

Starting from (11.6), we can now define the generalized Bregman divergence 
of Fs. Let N denote the convex function h — ||h||?-. Since N is differentiable, 
ON(h) = VN(h) for all h € H, and oN and thus By is uniquely defined. To 
make the definition of the Bregman divergences for F's and Rs compatible so that 
Br, = Bg, +\Bn, we define dRg in terms of JFs by: 6Rs(h) = 6Fs(h) — AVN(h) 
for all h € H. Furthermore, we choose dF's(h) to be 0 for any point h where F's is 
minimal and let dF's(h) be an arbitrary element of OF's(h) for all other h € H. We 
proceed in a similar way to define the Bregman divergences for F's, and Rg: so that 
Br, = BR, +ABy. 

We will use the notion of generalized Bregman divergence for the proof of the fol- 
lowing general upper bound on the stability coefficient of kernel-based regularization 
algorithms. 
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Proposition 11.1 

Let K be a positive definite symmetric kernel such that for all x € X, K(x,x) <r? 
for some r € Ry and let L be a convex and a-admissible loss function. Then, the 
kernel-based regularization algorithm defined by the minimization (11.6) is 3-stable 
with the following upper bound on 3: 


or? 


BS 


mr 

Proof Let h be a minimizer of F's and h’ a minimizer of F's, where samples S and 
S’ differ exactly by one point, 2, in S' and 2}, in $’. Since the generalized Bregman 
divergence is non-negative and since Bp, = Bg. + ABy and Br,, = BR, + ABn, 
we can write 


Brg (h'||h) + Brg, (hllh’) = A(By(h'||h) + By (Allh’)). 


Observe that By(h’||h) + Bn (hllh’) = — (h’ —h, 2h) — (h — hi’, 2h’) = 2\|h’ — hl|%. 
Let Ah denote h’ — h, then we can write 


2A\|AAll 
< Br, (h'||h) + Br, (h||h’) 
= Fg(h!) — Fs(h) — (h! — h, 5 5(h)) + For(h) — Fs (h’) — (h — h!, 6 51(h’)) 
= Fg(h') — Fs(h) + Fs: (h) — Fs-(h’) 
= Rg(h’) — Rs(h) + Rsr(h) — Ror(h’). 


The second equality follows from the definition of h’ and h as minimizers and our 
choice of the subgradients for minimal points which together imply 6F's5/(h’) = 0 
and 6F's(h) = 0. The last equality follows from the definitions of F's and Fs. Next, 
we express the resulting inequality in terms of the loss function LZ and use the fact 
that S and S’ differ by only one point along with the c-admissibility of L to get 


zm 


DAIAA||Ze < — [Lay (B") ~ Day (Rt) + Lay, (R) ~ Les, (H1)] 
< = [Ah(am)| + |Ah(2',)I] (11.9) 


By the reproducing kernel property and the Cauchy-Schwarz inequality , for all 
rERX, 


Ah(a) = (Ah, K(2,-)) < |All |K(@, Il = VK (a2, 2)||Aall« < r||Ahlla. 


In view of (11.9), this implies ||Ah||xk < <. By the o-admissibility of Z and the 


= Am* 


reproducing property, the following holds: 


Wz EX x ¥,|L(h’) — L.(h)| < o|AA(a)| < rollAllx, 
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which gives 


VWzEX x Y,|L,(h’) — L,(h)| < 


~ mr? 


and concludes the proof. 


Thus, under the assumptions of the proposition, for a fixed A, the stability coefficient 
of kernel-based regularization algorithms is in O(1/m). 


11.3.1 Application to regression algorithms: SVR and KRR 


Here, we analyze more specifically two widely used regression algorithms, Support 
Vector Regression (SVR) and Kernel Ridge Regression (KRR), which are both 
special instances of the family of kernel-based regularization algorithms. 

SVR is based on the c-insensitive loss L, defined for all (y, y’) € Y x Y by: 
if |y’—yl <6 


(11.10) 
ly’ —y|—e otherwise. 


L(y’, y) = ‘; 


We now present a stability-based bound for SVR assuming that L, is bounded for 
the hypotheses returned by SVR (which, as we shall later see in lemma 11.1, is 
indeed the case when the label set Y is bounded). 


Corollary 11.1 Stability-based learning bound for SVR 

Assume that K(x,x) < r? for allx € X for some r > 0 and that L. is bounded 
by M > 0. Let hg denote the hypothesis returned by SVR when trained on an 
i.i.d. sample S of size m. Then, for any 6 > 0, the following inequality holds with 
probability at least 1 — 06: 


2 2 ai 

R(hs) < R(hs) + + (= +M) aut 

Proof We first show that L.(-) = L<(-,y) is 1-Lipschitz for any y € Y. For any 

y',y” € Y, we must consider four cases. First, if |y/ — y| < € and |y” —y| < «, 

then |Le(y"”) — Le(y’)| = 0. Second, if jy’ — y| > e€ and |y” — y| > e, then 

[Le(y") — Le(y')| = ly” — yl — ly’ — yll < ly’ — y'|, by the triangle inequality. 

Third, if |y’ — yl < ¢ and |y” — yl > e, then |Le(y”) — Le(y")| = lly” — yl —€l = 

ly” -—yl-—e< ly” —-yl-ly'— yl < ly" —y'|. Fourth, if |y” — y| < and |y’ —y| >«, 
by symmetry the same inequality is obtained as in the previous case. 

Thus, in all cases, |L.(y”, y)—L-(y’, y)| < |y”—y’|. This implies in particular that 
L, is o-admissible with o = 1 for any hypothesis set H. By proposition 11.1, under 
the assumptions made, SVR is (-stable with @ < ee, Plugging this expression into 
the bound of theorem 11.1 yields the result. m 
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We next present a stability-based bound for KRR, which is based on the square 
loss L2 defined for all y’,y € Y by: 


Lo(y’,y) = (y' — 9)”. (11.11) 


As in the SVR setting, we assume in our analysis that [2 is bounded for the 
hypotheses returned by KRR (which, as we shall later see again in lemma 11.1, 
is indeed the case when the label set Y is bounded). 


Corollary 11.2 Stability-based learning bound for KRR 

Assume that K(x,x) < r? for allx € &X for some r > 0 and that Lz is bounded 
by M > 0. Let hg denote the hypothesis returned by KRR when trained on an 
i.t.d. sample S of size m. Then, for any 6 > 0, the following inequality holds with 
probability at least 1 — 06: 


log + 
Qm ° 


a 4Mr? ( 8Mr? 


R(hs) < R(hs) + Plies + M) 


Proof For any (x,y) € X x Y and h,h’ € H, 
— (he) - y)?| 
[A'(x) — h(@)][(h' (a) — 9) + (h(a) — y)]] 


( 
S ([h'(a) — yl + |h(a) — yl) |A(@) — h'(a)| 
2V'M|h(x) — h'(2)|, 


|L2(h'(x),y) — = 


where we used the M-boundedness of the loss. Thus, [2 is o-admissible with 
“ye . . re . 

o = 2VM. Therefore, by proposition 11.1, KRR is G-stable with @ < a Plugging 

this expression into the bound of theorem 11.1 yields the result. m 


The previous two corollaries assumed bounded loss functions. We now present a 
lemma that implies in particular that the loss functions used by SVR and KRR are 
bounded when the label set is bounded. 


Lemma 11.1 

Assume that K(x,x) < r? for all x € X for some r > 0 and that for all y € Y, 
L(0,y) < B for some B > 0. Then, the hypothesis hg returned by a kernel-based 
regularization algorithm trained on a sample S is bounded as follows: 


Va € X,|hs(2)| < r\/B/A. 


Proof By the reproducing kernel property and the Cauchy-Schwarz inequality , 
we can write 


Ve € X,|hs(z)| = (hs, K(2,-)) < [Ihsllx VK(@,2) <rllhs|lx. (11.12) 
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The minimization (11.6) is over H, which includes 0. Thus, by definition of F's and 
hg, the following inequality holds: 


1 m 
Fs(hs) < Fs(0) = — ))L(0,yi) < B. 
i=l 


Since the loss L is non-negative, we have A||hs||%- < Fs(hs) and thus A\|hs||% < B. 
Combining this inequality with (11.12) yields the result. m 


11.3.2 Application to classification algorithms: SVMs 


This section presents a generalization bound for SVMs, when using the standard 
hinge loss defined for all y € Y = {—1,+1} and y’ € R by 


0 if 1 — yy’ <0; 


(11.13) 
1—yy' otherwise. 


Ditngel ts y) = 
Corollary 11.3 Stability-based learning bound for SVMs 
Assume that K(x,x) <r? forallx € X for somer > 0. Let hg denote the hypothesis 
returned by SVMs when trained on an i.i.d. sample S' of sizem. Then, for any 6 > 0, 
the following inequality holds with probability at least 1 — 6: 
r? Qr? r log + 


R(hs) < R(hs) + + (F + +1) —. 


Proof It is straightforward to verify that Lpinge(-,y) is 1-Lipschitz for any y € Y 
and therefore that it is o-admissible with o = 1. Therefore, by proposition 11.1, 
SVMs is 3-stable with 3 < es Since |Lninge(0, y)| < 1 for any y € Y, by lemma 11.1, 
Va € X,|hg(x)| < r/VX. Thus, for any sample S and any x € X and y € J, the 
loss is bounded as follows: Lhinge(hs(z),y) < r/VA +1. Plugging this value of M 


and the one found for @ into the bound of theorem 11.1 yields the result. m 


Since the hinge loss upper bounds the binary loss, the bound of the corollary 11.3 
also applies to the generalization error of hg measured in terms of the standard 
binary loss used in classification. 


11.3.3. Discussion 


Note that the learning bounds presented for kernel-based regularization algorithms 
are of the form R(hs) — R(hs) < O(x7m)- Thus, these bounds are informative 
only when A >> 1/,/m. The regularization parameter \ is a function of the sample 
size m: for larger values of m, it is expected to be smaller, decreasing the emphasis 


on regularization. The magnitude of \ affects the norm of the linear hypotheses 
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used for prediction, with a larger value of \ implying a smaller hypothesis norm. In 
this sense, is a measure of the complexity of the hypothesis set and the condition 
required for \ can be interpreted as stating that a less complex hypothesis set 
guarantees better generalization. 

Note also that our analysis of stability in this chapter assumed a fixed A: the 
regularization parameter is assumed to be invariant to the change of one point of 
the training sample. While this is a mild assumption, it may not hold in general. 


11.4 Chapter notes 


The notion of algorithmic stability was first used by Devroye, Rogers and Wagner 
[Rogers and Wagner, 1978, Devroye and Wagner, 1979a,b] for the k-nearest neighbor 
algorithm and other k-local rules. Kearns and Ron [1999] later gave a formal defini- 
tion of stability and used it to provide an analysis of the leave-one-out error. Much 
of the material presented in this chapter is based on Bousquet and Elisseeff [2002]. 
Our proof of proposition 11.1 is novel and generalizes the results of Bousquet and 
Elisseeff [2002] to the case of non-differentiable convex losses. Moreover, stability- 
based generalization bounds have been extended to ranking algorithms [Agarwal 
and Niyogi, 2005, Cortes et al., 2007b], as well as to the non-i.i.d. scenario of sta- 
tionary ®- and (-mixing processes [Mohri and Rostamizadeh, 2010], and to the 
transductive setting [Cortes et al., 2008a]. Additionally, exercise 11.5 is based on 
Cortes et al. [2010b], which introduces and analyzes stability with respect to the 
choice of the kernel function or kernel matrix. 

Note that while, as shown in this chapter, uniform stability is sufficient for 
deriving generalization bounds, it is not a necessary condition. Some algorithms may 
generalize well in the supervised learning scenario but may not be uniformly stable, 
for example, the Lasso algorithm [Xu et al., 2008]. Shalev-Shwartz et al. [2009] 
have used the notion of stability to provide necessary and sufficient conditions for a 
technical condition of learnability related to PAC-learning, even in general scenarios 
where learning is possible only by using non-ERM rules. 


11.5 Exercises 
11.1 Tighter stability bounds 
(a) Assuming the conditions of theorem 11.1 hold, can one hope to guarantee 


a generalization with slack better than O(1/,/m) even if the algorithm is very 
stable, i.e. G — 0? 
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(b) Can you show an O(1/m) generalization guarantee if L is bounded by 
C/,/m (a very strong condition)? If so, how stable does the learning algorithm 
need to be? 


11.2 Quadratic hinge loss stability. Let LD denote the quadratic hinge loss function 
defined for all y € {+1,—1} and y’ € R by 


0 if l—y'y <0; 
(1—y'y)? otherwise. 


Ly’, y) = 


Assume that L(h(x),y) is bounded by M,1< M < o, for allh € H, x € &, and 
y € {+1, —1}, which also implies a bound on |h(x)| for all h € H and x € #. Derive 
a stability-based generalization bound for SVMs with the quadratic hinge loss. 


11.3 Stability of linear regression. 


(a) How does the stability bound in corollary 11.2 for ridge regression (i.e. 
kernel ridge regression with a linear kernel) behave as \ > 0? 


(b) Can you show a stability bound for linear regression (i.e. ridge regression 
with A = 0)? If not, show a counter-example. 


11.4 Kernel stability. Suppose an approximation of the kernel matrix K, denoted 
K’, is used to train the hypothesis h’ (and let h denote the non-approximate hypoth- 
esis). At test time, no approximation is made, so if we let k, = [K(2, £1),...,K(a, n)| iu 
we can write h(x) = a'k, and h’(x) = a’'k,. Show that if Vz, 2’ € ¥, K(2,2') <r 
then 


lal (a) — h(x) < 


< SII’ - Ki. 


(Hint: Use exercise 9.3) 
11.5 Stability of relative-entropy regularization. 


(a) Consider an algorithm that selects a distribution g over a hypothesis class 
which is parameterized by 6 € O. Given a point z = (x,y) the expected loss is 
defined as 


(9.2) =f L(bola).aal0) a0. 


with respect to a base loss function L. Assuming the loss function L is 
bounded by M, show that the expected loss H is M-admissible, i.e. show 
|H(9, 2) — H(9',z)| < M fg |g(9) — 9'(8)| dé. 
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(b) Consider an algorithm that minimizes the entropy regularized objective 
over the choice of distribution g: 


1 m 
F's(9) = A S> A(g, %) +AK(g, fo) - 
i=1 
Rs(g) 


Here, K is the Kullback-Leibler divergence (or relative entropy) between two 
distributions, 


K(o.f0) =f. 9(0) og aa dO, (11.14) 


and fo is some fixed distribution. Show that such an algorithm is stable by 
performing the following steps: 


i. First use the fact (fg |g(@) — 9'(@)| d0)? < K(g,g') (Pinsker’s inequal- 
ity), to show 


2 
( [, as(0) — 959(8)| 48)” < Bae sala!) + Bac. la) 


ii. Next, let g be the minimizer of F's and g’ the minimizer of F's, where 
S and S’ differ only at the index m. Show that 


Bxv.,fo(9llg’) + Bxc.,fo)(9'Il9) 
1 
< — Ha: 2m) — H(g,2m) + H(g, 2m) — H(9', i” 


< oy fla) 9)| do. 


iii. Finally, combine the results above to show that the entropy regularized 
algorithm is ou -stable. 


12 Dimensionality Reduction 


In settings where the data has a large number of features, it is often desirable 
to reduce its dimension, or to find a lower-dimensional representation preserving 
some of its properties. The key arguments for dimensionality reduction (or manifold 
learning) techniques are: 


=" Computational: to compress the initial data as a preprocessing step to speed up 
subsequent operations on the data. 


= Visualization: to visualize the data for exploratory analysis by mapping the input 
data into two- or three-dimensional spaces. 


= Feature extraction: to hopefully generate a smaller and more effective or useful 
set of features. 


The benefits of dimensionality reduction are often illustrated via simulated data, 
such as the Swiss roll dataset. In this example, the input data, depicted in fig- 
ure 12.1a, is three-dimensional, but it lies on a two-dimensional manifold that 
is “unfolded” in two-dimensional space as shown in figure 12.1b. It is important 
to note, however, that exact low-dimensional manifolds are rarely encountered in 
practice. Hence, this idealized example is more useful to illustrate the concept of 
dimensionality reduction than to verify the effectiveness of dimensionality reduction 


algorithms. 
Dimensionality reduction can be formalized as follows. Consider a sample S = 
(21,...,2%m), a feature mapping ®: XY — R* and the data matrix K ¢€ RY*™ 


defined as (®(2),..., ®(a@m)). The ith data point is represented by x; = ®(;), or 
the ith column of X, which is an N-dimensional vector. Dimensionality reduction 
techniques broadly aim to find, for k < N, a k-dimensional representation of the 
data, Y € R**™, that is in some way faithful to the original representation X. 

In this chapter we will discuss various techniques that address this problem. 
We first present the most commonly used dimensionality reduction technique called 
principal component analysis (PCA). We then introduce a kernelized version of PCA 
(KPCA) and show the connection between KPCA and manifold learning algorithms. 
We conclude with a presentation of the Johnson-Lindenstrauss lemma, a classical 
theoretical result that has inspired a variety of dimensionality reduction methods 
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(a) (b) 


Figure 12.1 The “Swiss roll” dataset. (a) high-dimensional representation. (b) 
lower-dimensional representation. 


based on the concept of random projections. The discussion in this chapter relies 
on basic matrix properties that are reviewed in appendix A. 


12.1 Principal Component Analysis 


Fix k € [1, N] and let X be a mean-centered data matrix, that is, 7)", x; = 0. 
Define P; as the set of N-dimensional rank-k orthogonal projection matrices. 
PCA consists of projecting the N-dimensional input data onto the k-dimensional 
linear subspace that minimizes reconstruction error, that is the sum of the squared 
L-distances between the original data and the projected data. Thus, the PCA 
algorithm is completely defined by the orthogonal projection matrix solution P* of 
the following minimization problem: 


i PX —X|/2.. 12.1 
gain | I: (12.1) 


The following theorem shows that PCA coincides with the projection of each 
data point onto the k top singular vectors of the sample covariance matrix, i.e., 
C= 1xx! for the mean-centered data matrix X. Figure 12.2 illustrates the 
basic intuition behind PCA, showing how two-dimensional data points with highly 
correlated features can be more succinctly represented with a one-dimensional 
representation that captures most of the variance in the data. 


Theorem 12.1 

Let P* € Py be the PCA solution, i.e., the orthogonal projection matrix solution of 
(12.1). Then, P* = U,U], where Ux, € RN** is the matrix formed by the top k 
singular vectors of C = 1xx', the sample covariance matrix corresponding to X. 
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Moreover, the associated k-dimensional representation of X is given by Y = UTX. 


Proof Let P = P' be an orthogonal projection matrix. By the definition of 
the Frobenius norm, the linearity of the trace operator and the fact that P is 
idempotent, i.e., P? = P, we observe that 


[PX — X||?, = Tr[(PK — XK)" (PX — X)] = Try[X' P?xK —- 2x" PX +X'X] 
= —Tr(X ! PX] + Tr[x' x]. 


Since Tr[X' X] is a constant with respect to P, we have 


min ||PX — X||?, = max Tr[X' PX]. (12.2) 
PEP, PEP, 


By definition of orthogonal projections in P,, P = UU' for some U € RX** 
containing orthogonal columns. Using the invariance of the trace operator under 
cyclic permutations and the orthogonality of the columns of U, we have 


k 
Tr[X'PX] =U'XX'U=)ou) XX'u,, 
i=1 
where u; is the ith column of U. By the Rayleigh quotient (section A.2.3), it is clear 
that the largest k singular vectors of XX ' maximize the rightmost sum above. Since 
XX! and C differ only by a scaling factor, they have the same singular vectors, 
and thus U; maximizes this sum, which proves the first statement of the theorem. 
Finally, since PX = U,U;, X, Y= UrIx is a k-dimensional representation of X 
with U;, as the basis vectors. 


By definition of the covariance matrix, the top singular vectors of C are the 
directions of maximal variance in the data, and the associated singular values 
are equal to these variances. Hence, PCA can also be viewed as projecting onto 
the subspace of maximal variance. Under this interpretation, the first principal 
component is derived from projection onto the direction of maximal variance, given 
by the top singular vector of C. Similarly, the th principal component, for 1 <i < k, 
is derived from projection onto the ith direction of maximal variance, subject to 
orthogonality constraints to the previous i — 1 directions of maximal variance (see 
exercise 12.1 for more details). 


12.2. Kernel Principal Component Analysis (KPCA) 


In the previous section, we presented the PCA algorithm, which involved projecting 
onto the singular vectors of the sample covariance matrix C. In this section, we 
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Figure 12.2 Example of PCA. (a) Two-dimensional data points with features cap- 
turing shoe size measured with different units. (b) One-dimensional representation 
(blue squares) that captures the most variance in the data, generated by projecting 
onto largest principal component (red line) of the mean-centered data points. 


present a kernelized version of PCA, called KPCA. In the KPCA setting, ® is 
a feature mapping to an arbitrary RKHS (not necessarily to R‘) and we work 
exclusively with a kernel function K corresponding to the inner product in this 
RKHS. The KPCA algorithm can thus be defined as a generalization of PCA in 
which the input data is projected onto the top principle components in this RKHS. 
We will show the relationship between PCA and KPCA by drawing upon the deep 
connections among the SVDs of X, C and K. We then illustrate how various 
manifold learning algorithms can be interpreted as special instances of KPCA. 

Let K be a PDS kernel defined over VY x # and define the kernel matrix as K = 
X!X. Since X admits the following singular value decomposition: X = UYV!, C 
and K can be rewritten as follows: 


1 
C=—UAU' K=VAV', (12.3) 
m 


where A = %? is the diagonal matrix of the singular values of mC and U is the 
matrix of the singular vectors of C (and mC). 

Starting with the SVD of X, note that right multiplying by V/~! and using the 
relationship between A and ¥ yields U = XVA7~!/?. Thus, the singular vector u of 
C associated to the singular value \/m coincides with _ where v is the singular 
vector of K associated to ». Now fix an arbitrary feature vector x = ®(x) for 
x € &X. Then, following the expression for Y in theorem 12.1, the one-dimensional 
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representation of x derived by projection onto P,, = uu! is defined by 


x ra 
ee” a (12.4) 
VX VX 
where k, = (K(x1,2),...,K(%m,2))'. If x is one of the data points, i.e., x = x; for 


1<i<~m, then k, is the ith column of K and (12.4) can be simplified as follows: 


T kiv AV; 
xT Ie = = VA, (12.5) 
where v; is the ith component of v. More generally, the PCA solution of theorem 12.1 
can be fully defined by the top & singular vectors of K, vj,...,vz, and the 
corresponding singular values. This alternative derivation of the PCA solution in 
terms of K precisely defines the KPCA solution, providing a generalization of PCA 
via the use of PDS kernels (see chapter 5 for more details on kernel methods). 


12.3 KPCA and manifold learning 


Several manifold learning techniques have been proposed as non-linear methods for 
dimensionality reduction. These algorithms implicitly assume that high-dimensional 
data lie on or near a low-dimensional non-linear manifold embedded in the input 
space. They aim to learn this manifold structure by finding a low-dimensional 
space that in some way preserves the local structure of high-dimensional input 
data. For instance, the Isomap algorithm aims to preserve approximate geodesic 
distances, or distances along the manifold, between all pairs of data points. Other 
algorithms, such as Laplacian eigenmaps and locally linear embedding, focus only 
on preserving local neighborhood relationships in the high-dimensional space. We 
will next describe these classical manifold learning algorithms and then interpret 
them as specific instances of KPCA. 


12.3.1 Isomap 


Isomap aims to extract a low-dimensional data representation that best preserves 
all pairwise distances between input points, as measured by their geodesic distances 
along the underlying manifold. It approximates geodesic distance assuming that L 
distance provides good approximations for nearby points, and for faraway points 
it estimates distance as a series of hops between neighboring points. The Isomap 
algorithm works as follows: 


1. Find the t nearest neighbors for each data point based on Lz distance and 
construct an undirected neighborhood graph, denoted by G, with points as nodes 
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and links between neighbors as edges. 

2. Compute the approximate geodesic distances, A;;, between all pairs of nodes 
(i,7) by computing all-pairs shortest distances in G using, for instance, the Floyd- 
Warshall algorithm. 


3. Convert the squared distance matrix into amxm similarity matrix by performing 
double centering, i.e., compute Ky,, = —sHAH, where A is the squared distance 
matrix, H=[,, — 411! is the centering matrix, I, is the m x m identity matrix 
and 1 is a column vector of all ones (for more details on double centering see 
exercise 12.2). 


4. Find the optimal k-dimensional representation, Y = {y;}?_,, such that Y = 
argminy: >>; ; (ly; — ¥5lls — A7,). The solution is given by, 


Y=(Sies Ua (12.6) 


where Njgo,4 is the diagonal matrix of the top & singular values of Ky, and Utgso,z 
are the associated singular vectors. 


Kjs. can naturally be viewed as a kernel matrix, thus providing a simple connection 
between Isomap and KPCA. Note, however, that this interpretation is valid only 
when Kj, is in fact positive semidefinite, which is indeed the case in the continuum 
limit for a smooth manifold. 


12.3.2 Laplacian eigenmaps 


The Laplacian eigenmaps algorithm aims to find a low-dimensional representation 
that best preserves neighborhood relations as measured by a weight matrix W. The 
algorithm works as follows: 


1. Find ¢ nearest neighbors for each point. 

2. Construct W, a sparse, symmetric m x m matrix, where W;; = exp ( — ||x; — 
x;||3/o7) if (x:,x;) are neighbors, 0 otherwise, and o is a scaling parameter. 

3. Construct the diagonal matrix D, such that D;; = a, Wij. 

4. Find the k-dimensional representation by minimizing the weighted distance 
between neighbors as, 


Y= argmin ) | Wajllyi — yj (12.7) 
This objective function penalizes nearby inputs for being mapped to faraway 
outputs, with “nearness” measured by the weight matrix W. The solution to the 
minimization in (12.7) is Y = UL: where L = D — W is the graph Laplacian 
and UL, , are the bottom & singular vectors of L, excluding the last singular vector 
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corresponding to the singular value 0 (assuming that the underlying neighborhood 
graph is connected). 


The solution to (12.7) can also be interpreted as finding the largest singular 
vectors of L', the pseudo-inverse of L. Defining Ky = L' we can thus view Laplacian 
Eigenmaps as an instance of KPCA in which the output dimensions are normalized 
to have unit variance, which corresponds to setting A = 1 in (12.5). Moreover, it 
can be shown that Ky, is the kernel matrix associated with the commute times of 
diffusion on the underlying neighborhood graph, where the commute time between 
nodes 7 and 7 in a graph is the expected time taken for a random walk to start at 
node 7, reach node j and then return to 7. 


12.3.3 Locally linear embedding (LLE) 


The Locally linear embedding (LLE) algorithm also aims to find a low-dimensional 
representation that preserves neighborhood relations as measured by a weight 
matrix W. The algorithm works as follows: 


1. Find ¢ nearest neighbors for each point. 


2. Construct W, a sparse, symmetric m x m matrix, whose 7th row sums to one and 
contains the linear coefficients that optimally reconstruct x; from its t neighbors. 
More specifically, if we assume that the ith row of W sums to one, then the 
reconstruction error is 


2 2 
GEN; GEN; TREN: 
where N; is the set of indices of the neighbors of point x; and C; p= (%& —x,)! (x; _ 


x;,) the local covariance matrix. Minimizing this expression with the constraint 
>; Wij; = 1 gives the solution 


ye ae 
Note that the solution can be equivalently obtained by first solving the system of 
linear equations )); C},;Wi; = 1, for k € Ni, and then normalizing so that the 
weights sum to one. 


Ww Lx(C "in (12.9) 


3. Find the k-dimensional representation that best obeys neighborhood relations as 
specified by W, i.e., 


Y= argmin ) | (y; - S- Wy’) - (12.10) 


a J 
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The solution to the minimization in (12.10) is Y = Une where M = (I-W')(I- 
W') and Um, , are the bottom k singular vectors of M, excluding the last singular 
vector corresponding to the singular value 0. 


As discussed in exercise 12.5, LLE coincides with KPCA used with a particular 
kernel matrix Kz,~2 whereby the output dimensions are normalized to have unit 
variance (as in the case of Laplacian Eigenmaps). 


12.4 Johnson-Lindenstrauss lemma 


The Johnson-Lindenstrauss lemma is a fundamental result in dimensionality reduc- 
tion that states that any m points in high-dimensional space can be mapped to a 
much lower dimension, k > o( 2s 
any two points by more than a factor of (1+ €). In fact, such a mapping can be 


), without distorting pairwise distance between 


found in randomized polynomial time by projecting the high-dimensional points 
onto randomly chosen k-dimensional linear subspaces. The Johnson-Lindenstrauss 
lemma is formally presented in lemma 12.3. The proof of this lemma hinges on 
lemma 12.1 and lemma 12.2, and it is an example of the “probabilistic method”, 
in which probabilistic arguments lead to a deterministic statement. Moreover, as 
we will see, the Johnson-Lindenstrauss lemma follows by showing that the squared 
length of a random vector is sharply concentrated around its mean when the vector 
is projected onto a k-dimensional random subspace. 

First, we prove the following property of the y?-squared distribution (see defini- 
tion C.6 in appendix), which will be used in lemma 12.2. 


Lemma 12.1 
Let Q be a random variable following a x?-squared distribution with k degrees of 
freedom. Then, for any 0 <€ < 1/2, the following inequality holds: 


Pri(1 —e)k <Q < (1+. ek] > 1—2e (OKI, (12.11) 


Proof By Markov’s inequality, we can write 


Elexp(AQ)] 
exp(A(1 + €)k) 

(1 — 2d)—*/? 
exp(A(1 4 6)k) ’ 


Pr[Q > (1 + €)k] = Pr[exp(AQ) > exp(A(1 + ©)k)] 


where we used for the final equality the expression of the moment-generating 
function of a y?-squared distribution, E[exp(AQ)], for A < 1/2 (equation C.14). 


Choosing A = Mir < 1/2, which minimizes the right-hand side of the final 
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equality, and using the identity 1 + «€ < exp(e— (e? — e°)/2) yield 


miozavons (ss) <(aag) -a0(-Fe-*). 


The statement of the lemma follows by using similar techniques to bound Pr[Q < 
(1 — €)k] and by applying the union bound. m 


Lemma 12.2 

Let x € RN, define k < N and assume that entries in A € R**N are sampled 
independently from the standard normal distribution, N(0,1). Then, for any 0 < 
e< 1/2, 


1 P 
. G — bel < ZEA? < 0+ oe? > 1-264, (12.12) 


Proof Let X = Ax and observe that 


N 2 N N 
E[z4] =E (So Ajixi) | =E [24322] = =a 
i=1 i=1 i=l 


The second and third equalities follow from the independence and unit variance, 
respectively, of the Aj;;. Now, define T; = @;/||x|| and note that the Tjs are 
independent standard normal random variables since the Aj; are i.i.d. standard 
normal random variables and E[%7] = ||x||?. Thus, the variable Q defined by 
Q= Dy T; follows a x?-squared distribution with k degrees of freedom and 
we have 


1)? 


Pr |(1—e)|x|? < 


where the final inequality holds by lemma 12.1, thus proving the statement of the 
lemma. 


Lemma 12.3 Johnson-Lindenstrauss 
For any0 <€< 1/2 and any integer m > 4, let k = 20 "oem Then for any set V of 
m points in RY, there exists a map f: RN — R* such that for all u,v € V, 


(1—)|ju—vi? < Ilf(a) — FIP S$ + elu — vl. (12.13) 


Proof Let f = --A where k < N and entries in A € R**% are sampled 
Vk 
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independently from the standard normal distribution, N(0,1). For fixed u,v € V, 
we can apply lemma 12.2, with x = u — v, to lower bound the success probability 
by 1—2e-(-©)k/4, Applying the union bound over the O(m?) pairs in V, setting 
k= — log m and upper bounding € by 1/2, we have 


Pr[success] > 1 — Ime -P)K/4 — 1 — Om5*-3 3 1 -— Am-1/2 > 0. 


Since the success probability is strictly greater than zero, a map that satisfies the 
desired conditions must exist, thus proving the statement of the lemma. m 


12.5 Chapter notes 


PCA was introduced in the early 1900s by Pearson [1901]. KPCA was introduced 
roughly a century later, and our presentation of KPCA is a more concise derivation 
of results given by Mika et al. [1999]. Isomap and LLE were pioneering works on 
non-linear dimensionality reduction introduced byTenenbaum et al. [2000], Roweis 
and Saul [2000]. Isomap itself is a generalization of a standard linear dimensionality 
reduction technique called Multidimensional Scaling [Cox and Cox, 2000]. Isomap 
and LLE led to the development of several related algorithms for manifold learning, 
e.g., Laplacian Eigenmaps and Maximum Variance Unfolding [Belkin and Niyogi, 
2001, Weinberger and Saul, 2006]. As shown in this chapter, classical manifold 
learning algorithms are special instances of KPCA [Ham et al., 2004]. The Johnson- 
Lindenstrauss lemma was introduced by Johnson and Lindenstrauss [1984], though 
our proof of the lemma follows Vempala [2004]. Other simplified proofs of this lemma 
have also been presented, including Dasgupta and Gupta [2003]. 


12.6 Exercises 


12.1 PCA and maximal variance. Let X be an uncentered data matrix and let 
x = + >, x; be the sample mean of the columns of X. 
(a) Show that the variance of one-dimensional projections of the data onto an 
arbitrary vector u equals u' Cu, where C = + 57,(x; — X)(x; — x)" is the 
sample covariance matrix. 


(b) Show that PCA with & = 1 projects the data onto the direction (i.e., 
u!u = 1) of maximal variance. 


12.2 Double centering. In this problem we will prove the correctness of the double 
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centering step in Isomap when working with Euclidean distances. Define X and x as 
in exercise 12.1, and define X* as the centered version of X, that is, let x = x; —x 
be the ith column of X*. Let K = X'X, and let D denote the Euclidean distance 
matrix, ie., Dj; = ||x; — x,||. 


(a) Show that Kj; = $(Kii + Kj; + Dj;). 
(b) Show that K* = X*'X* =K—-4K11" -411'K+ 411'K11". 
(c) Using the results from (a) and (b) show that 


« lfnn Iype 1p -F 
Ki; = -5|D% =) ae ~y-Di,+D) 
k=1 k=1 


where D = 4; >, 0, D2., is the mean of the m? entries in D. 


U,U 


(d) Show that K* = -LHDH. 


12.3 Laplacian eigenmaps. Assume k = 1 and we seek a one-dimensional represen- 
tation y. Show that (12.7) is equivalent to y = argmin,, y’' Ly’, where L is the 
graph Laplacian. 


12.4 Nystr6m method. Define the following block representation of a kernel matrix: 


W 
Kai |. 


The Nystrém method uses W € R'*! and C € R™*! to generate the approximation 


K=CW'C! &K. 


x.|W Ka 
Koi Koo 


| and C= 


(a) Show that W is SPSD and that ||K — K|| p = ||K22 — Ko, W'KJ |p. 

(b) Let K = X'X for some X € RN*™, and let KX’ € R%*! be the first 
1 columns of X. Show that K = X'Py,,X, where Py, is the orthogonal 
projection onto the span of the left singular vectors of X’. 

(c) Is K SPSD? 

(d) If rank(K) = rank(W) =r < m, show that K = K. Note: this statement 
holds whenever rank(K) = rank(W), but is of interest mainly in the low-rank 
setting. 

(e) Ifm = 20M and K is a dense matrix, how much space is required to store K 
if each entry is stored as a double? How much space is required by the Nystr6ém 
method if | = 10K? 
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12.5 Expression for Kye. Show the connection between LLE and KPCA by 
deriving the expression for Ky yz. 


12.6 Random projection, PCA, and nearest neighbors. 


(a) Download the MNIST test set of handwritten digits at: 
http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz. 


Create a data matrix X € RY*™ from the first m = 2,000 instances of this 
dataset (the dimension of each instance should be N = 784). 


(b) Find the ten nearest neighbors for each point in X, that is, compute Nj,10 
for 1 <i <m, where Nj, denotes the set of the ¢ nearest neighbors for the ith 
datapoint and nearest neighbors are defined with respect to the Lz norm. Also 
compute NVj,50 for all i. 


(c) Generate X = AX, where A € R**N, & = 100 and entries of A are 
sampled independently from the standard normal distribution. Find the ten 
nearest neighbors for each point in X, that is, compute N10 for 1 <i <m. 


(d) Report the quality of approximation by computing scorej9 = 4 1 INi,10M 
Ni,,10|- Similarly, compute scores9 = 4 ar INi.50 AN;,,10]- 

(e) Generate two plots that show scorejg and scoreso as functions of & (ie., 
perform steps (c) and (d) for k = {1, 10,50, 100, 250, 500}). Provide a one- or 
two-sentence explanation of these plots. 

(f) Generate similar plots as in (e) using PCA (with various values of k) to gen- 
erate X and subsequently compute nearest neighbors. Are the nearest neighbor 
approximations generated via PCA better or worse than those generated via 
random projections? Explain why. 


13. Learning Automata and Languages 


This chapter presents an introduction to the problem of learning languages. This 
is a classical problem explored since the early days of formal language theory and 
computer science, and there is a very large body of literature dealing with related 
mathematical questions. In this chapter, we present a brief introduction to this 
problem and concentrate specifically on the question of learning finite automata, 
which, by itself, has been a topic investigated in multiple forms by thousands of 
technical papers. We will examine two broad frameworks for learning automata, 
and for each, we will present an algorithm. In particular, we describe an algorithm 
for learning automata in which the learner has access to several types of query, and 
we discuss an algorithm for identifying a sub-class of the family of automata in the 
limit. 


13.1 Introduction 


Learning languages is one of the earliest problems discussed in linguistics and 
computer science. It has been prompted by the remarkable faculty of humans to 
learn natural languages. Humans are capable of uttering well-formed new sentences 
at an early age, after having been exposed only to finitely many sentences. Moreover, 
even at an early age, they can make accurate judgments of grammaticality for new 
sentences. 

In computer science, the problem of learning languages is directly related to that 
of learning the representation of the computational device generating a language. 
Thus, for example, learning regular languages is equivalent to learning finite au- 
tomata, or learning context-free languages or context-free grammars is equivalent 
to learning pushdown automata. 

There are several reasons for examining specifically the problem of learning 
finite automata. Automata provide natural modeling representations in a variety 
of different domains including systems, networking, image processing, text and 
speech processing, logic and many others. Automata can also serve as simple or 
efficient approximations for more complex devices. For example, in natural language 
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Figure 13.1 (a) A graphical representation of a finite automaton. (b) Equivalent 


(minimal) deterministic automaton. 


processing, they can be used to approximate context-free languages. When it is 
possible, learning automata is often efficient, though, as we shall see, the problem 
is hard in a number of natural scenarios. Thus, learning more complex devices or 
languages is even harder. 

We consider two general learning frameworks: the model of efficient exact learning 
and the model of identification in the limit. For each of these models, we briefly 
discuss the problem of learning automata and describe an algorithm. 

We first give a brief review of some basic automata definitions and algorithms, 
then discuss the problem of efficient exact learning of automata and that of the 
identification in the limit. 


13.2 Finite automata 


We will denote by © a finite alphabet. The length of a string x € %* over that 
alphabet is denoted by |a|. The empty string is denoted by e, thus |e] = 0. For any 
string x = 2 --- x, € &* of length k > 0, we denote by 2z[j] = 21 --- x; its prefix of 
length j < k and define [0] as e. 

Finite automata are labeled directed graphs equipped with initial and final states. 
The following gives a formal definition of these devices. 


Definition 13.1 Finite automata 

A finite automaton A is a 5-tuple (©,Q,I, F,E) where & is a finite alphabet, Q a 
finite set of states, I C Q a set of initial states, F C Q a set of final states, and 
ECQx (XU {e}) x Q a finite set of transitions. 


Figure 13.1a shows a simple example of a finite automaton. States are represented 
by circles. A bold circle indicates an initial state, a double circle a final state. Each 
transition is represented by an arrow from its origin state to its destination state 
with its label in © U {e}. 

A path from an initial state to a final state is said to be an accepting path. An 
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automaton is said to be trim if all of its states are accessible from an initial state 
and admit a path to a final state, that is, if all of its states lie on an accepting 
path. A string x € }* is accepted by an automaton A iff x labels an accepting path. 
For convenience, we will say that 2 € &* is rejected by A when it is not accepted. 
The set of all strings accepted by A defines the language accepted by A denoted by 
L(A). The class of languages accepted by finite automata coincides with the family 
of regular languages, that is, languages that can be described by regular expressions. 

Any finite automaton admits an equivalent automaton with no e€-transition, that 
is, no transition labeled with the empty string: there exists a general e-removal 
algorithm that takes as input an automaton and returns an equivalent automaton 
with no e-transition. 

An automaton with no €-transition is said to be deterministic if it admits a unique 
initial state and if no two transitions sharing the same label leave any given state. 
A deterministic finite automaton is often referred to by the acronym DFA, while 
the acronym NFA is used for arbitrary automata, that is, non-deterministic finite 
automata. Any NFA admits an equivalent DFA: there exists a general (exponential- 
time) determinization algorithm that takes as input an NFA with no e-transition 
and returns an equivalent DFA. Thus, the class of languages accepted by DFAs 
coincides with that of the languages accepted by NFAs, that is regular languages. 
For any string « € &* and DFA A, we denote by A(x) the state reached in A when 
reading x from its unique initial state. 

A DFA is said to be minimal if it admits no equivalent deterministic automaton 
with a smaller number of states. There exists a general minimization algorithm 
taking as input a deterministic automaton and returning a minimal one that runs 
in O(|E£| log |Q|). When the input DFA is acyclic, that is when it admits no path 
forming a cycle, it can be minimized in linear time O(|Q|+|£|). Figure 13.1b shows 
the minimal DFA equivalent to the NFA of figure 13.1a. 


13.3 Efficient exact learning 


In the efficient exact learning framework, the problem consists of identifying a 
target concept c from a finite set of examples in time polynomial in the size of the 
representation of the concept and in an upper bound on the size of the representation 
of an example. Unlike the PAC-learning framework, in this model, there is no 
stochastic assumption, instances are not assumed to be drawn according to some 
unknown distribution. Furthermore, the objective is to identify the target concept 
exactly, without any approximation. A concept class C' is said to be efficiently 
exactly learnable if there is an algorithm for efficient exact learning of any c € C. 
We will consider two different scenarios within the framework of efficiently exact 
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learning: a passive and an active learning scenario. The passive learning scenario is 
similar to the standard supervised learning scenario discussed in previous chapters 
but without any stochastic assumption: the learning algorithm passively receives 
data instances as in the PAC model and returns a hypothesis, but here, instances 
are not assumed to be drawn from any distribution. In the active learning scenario, 
the learner actively participates in the selection of the training samples by using 
various types of queries that we will describe. In both cases, we will focus more 
specifically on the problem of learning automata. 


13.3.1 Passive learning 


The problem of learning finite automata in this scenario is known as the minimum 
consistent DFA learning problem . It can be formulated as follows: the learner 
receives a finite sample S' = ((21,Y1),---;(@m;Ym)) with x; € X* and y; € {—1, +1} 
for any 7 € [1, mJ]. If y; = +1, then 2; is an accepted string, otherwise it is rejected. 
The problem consists of using this sample to learn the smallest DFA A consistent 
with S, that is the automaton with the smallest number of states that accepts the 
strings of S with label +1 and rejects those with label —1. Note that seeking the 
smallest DFA consistent with S can be viewed as following Occam’s razor principle. 

The problem just described is distinct from the standard minimization of DFAs. A 
minimal DFA accepting exactly the strings of S labeled positively may not have the 
smallest number of states: in general there may be DFAs with fewer states accepting 
a superset of these strings and rejecting the negatively labeled sample strings. 
For example, in the simple case S = ((a,+1),(b,—1)), a minimal deterministic 
automaton accepting the unique positively labeled string a or the unique negatively 
labeled string b admits two states. However, the deterministic automaton accepting 
the language a* accepts a and rejects b and has only one state. 

Passive learning of finite automata turns out to be a computationally hard 
problem. The following theorems present several negative results known for this 
problem. 


Theorem 13.1 
The problem of finding the smallest deterministic automaton consistent with a set 
of accepted or rejected strings is NP-complete. 


Hardness results are known even for a polynomial approximation, as stated by the 
following theorem. 


Theorem 13.2 

If P # NP, then, no polynomial-time algorithm can be guaranteed to find a DFA 
consistent with a set of accepted or rejected strings of size smaller than a polynomial 
function of the smallest consistent DFA, even when the alphabet is reduced to just 
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two elements. 


Other strong negative results are known for passive learning of finite automata 
under various cryptographic assumptions. 

These negative results for passive learning invite us to consider alternative 
learning scenarios for finite automata. The next section describes a scenario leading 
to more positive results where the learner can actively participate in the data 
selection process using various types of queries. 


13.3.2 Learning with queries 


The model of learning with queries corresponds to that of a (minimal) teacher or 
oracle and an active learner. In this model, the learner can make the following two 
types of queries to which an oracle responds: 


= membership queries: the learner requests the target label f(a) € {—1,+1} of an 
instance x and receives that label; 


= equivalence queries: the learner conjectures hypothesis h; he receives the response 
yes if h = f, a counter-example otherwise. 


We will say that a concept class C is efficiently exactly learnable with membership 
and equivalence queries when it is efficiently exactly learnable within this model. 

This model is not realistic, since no such oracle is typically available in practice. 
Nevertheless, it provides a natural framework, which, as we shall see, leads to 
positive results. Note also that for this model to be significant, equivalence must be 
computationally testable. This would not be the case for some concept classes such 
as that of context-free grammars, for example, for which the equivalence problem is 
undecidable. In fact, equivalence must be further efficiently testable, otherwise the 
response to the learner cannot be supplied in a reasonable amount of time.! 

Efficient exact learning within this model of learning with queries implies the 
following variant of PAC-learning: we will say that a concept class C is PAC- 
learnable with membership queries if it is PAC-learnable by an algorithm that has 
access to a polynomial number of membership queries. 


Theorem 13.3 
Let C be a concept class that is efficiently exactly learnable with membership and 
equivalence queries, then C is PAC-learnable using membership queries. 


1. For a human oracle, answering membership queries may also become very hard in some 
cases when the queries are near the class boundaries. This may also make the model 
difficult to adopt in practice. 
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Proof Let A be an algorithm for efficiently exactly learning C using membership 
and equivalence queries. Fix €,6 > 0. We replace in the execution of A for learning 
target c € C, each equivalence query by a test of the current hypothesis on a 
polynomial number of labeled examples. Let D be the distribution according to 
which points are drawn. To simulate the tth equivalence query, we draw m = 
+ (log + + tlog2) points i.i.d. according to D to test the current hypothesis h,. If 
hy, is consistent with all of these points, then the algorithm stops and returns h;. 
Otherwise, one of the points drawn does not belong to hy, which provides a counter- 
example. 

Since A learns c exactly, it makes at most T' equivalence queries, where T' is 
polynomial in the size of the representation of the target concept and in an upper 
bound on the size of the representation of an example. Thus, if no equivalence 
query is positively responded by the simulation, the algorithm will terminate after T’ 
equivalence queries and return the correct concept c. Otherwise, the algorithm stops 
at the first equivalence query positively responded by the simulation. The hypothesis 
it returns is not an ¢-approximation only if the equivalence query stopping the 
algorithm is incorrectly responded positively. By the union bound, since for any 
fixed ¢ € [1,7], Pr[R(h:) > e] < (1— ©)”, the probability that for some t € [1,7], 
R(ht) > € can be bounded as follows: 


T 
Pr[3t € [1,7]: R(h) > ] < S- Pr[R(he) > €] 
i=1 
Tf T oT 5 +oo 4 
<yru-gms yes ecy gas 
w=1 i=1 w=1 i=l 


Thus, with probability at least 1 — 6, the hypothesis returned by the algorithm is 
. é 7 7 7 . 6 hd 

an €-approximation. Finally, the maximum number of points drawn is )>,_; mi = 

L(Tlog 4 + FEY jog 2), which is polynomial in 1/e, 1/5, and T. Since the rest 

of the computational cost of A is also polynomial by assumption, this proves the 

PAC-learning of C. 


13.3.3 Learning automata with queries 


In this section, we describe an algorithm for efficient exact learning of DFAs with 
membership and equivalence queries. We will denote by A the target DFA and by 
A the DFA that is the current hypothesis of the algorithm. For the discussion of 
the algorithm, we assume without loss of generality that A is a minimal DFA. 
The algorithm uses two sets of strings, U and V. U is a set of access strings: 
reading an access string u € U from the initial state of A leads to a state A(u). The 
algorithm ensures that the states A(u), u € U, are all distinct. To do so, it uses a 
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Figure 13.2 (a) Classification tree T, with U = {e, b, ba} and V = {e, a}. (b) Current 
automaton A constructed using T. (c) Target automaton A. 


set V of distinguishing strings. Since A is minimal, for two distinct states q and q’ 
of A, there must exist at least one string that leads to a final state from q and not 
from q’, or vice versa. That string helps distinguish q and q’. The set of strings V 
help distinguish any pair of access strings in U. They define in fact a partition of 
all strings of b*. 

The objective of the algorithm is to find at each iteration a new access string 
distinguished from all previous ones, ultimately obtaining a number of access strings 
equal to the number of states of A. It can then identify each state A(u) of A with 
its access string u. To find the destination state of the transition labeled with a € U 
leaving state u, it suffices to determine, using the partition induced by V the access 
string u’ that belongs to the same equivalence class as ua. The finality of each state 
can be determined in a similar way. 

Both sets U and V are maintained by the algorithm via a binary decision tree T’ 
similar to those presented in chapter 8. Figure 13.2a shows an example. T' defines 
the partition of all strings induced by the distinguishing strings V. The leaves of T 
are each labeled with a distinct u € U and its internal nodes with a string uv € V. 
The decision tree question defined by v € V, given a string x € b*, is whether xv 
is accepted by A, which is determined via a membership query. If accepted, x is 
assigned to right sub-tree, otherwise to the left sub-tree, and the same is applied 
recursively with the sub-trees until a leaf is reached. We denote by T(x) the label of 
the leaf reached. For example, for the tree T of figure 13.2a and target automaton A 
of figure 13.2c, T(baa) = b since baa is not accepted by A (root question) and baaa 
is (question at node a). At its initialization step, the algorithm ensures that the 
root node is labeled with €, which is convenient to check the finality of the strings. 

The tentative hypothesis DFA A can be constructed from T as follows. We denote 
by CONSTRUCTAUTOMATON() the corresponding function. A distinct state A(u) is 
created for each leaf u € V. The finality of a state A(u) is determined based on 
the sub-tree of the root node that u belongs to: A(u) is made final iff u belongs 
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QUERY LEARNAUTOMATA() 
1 ¢t — MEMBERSHIPQUERY(e) 
a Teh 
Aw Ao 
while (EQUIVALENCEQUERY(A) 4 TRUE) do 
x — COUNTEREXAMPLE() 
if (T = To) then 
T — T,P NIL replaced with z. 
else j — argmin, A(2[k]) #7 A(a{k]) 
Spiit(A(zx[j — 1])) 
10 Ac CONSTRUCTAUTOMATON(T) 


11 return A 


Oo AN DT Fw 


Figure 13.3 Algorithm for learning automata with membership and equivalence 
queries. Ap is a single-state automaton with self-loops labeled with all a € ©. That 
state is initial. It is final iff t = TRUE. Tp is a tree with root node labeled with e and 
two leaves, one labeled with ¢«, the other with nit. the right leaf is labeled with e 
labels iff ¢ = TRUE. T; is the tree obtained from To by replacing NiL with x. 


to the right sub-tree that is iff u = eu is accepted by A. The destination of the 
transition labeled with a € ¥ leaving state A(u) is the state A(v) where v = T(ua). 
Figure 13.2b shows the DFA A constructed from the decision tree of figure 13.2a. 
For convenience, for any 7 € &*, we denote by U (A(z)) the access string identifying 
state A(z). 

Figure 13.3 shows the pseudocode of the algorithm. The initialization steps at 
lines 1-3 construct a tree T’ with a single internal node labeled with € and one leaf 
string labeled with ¢, the other left undetermined and labeled with NIL. They also 
define a tentative DFA A with a single state with self-loops labeled with all elements 
of the alphabet. That single state is an initial state. It is made a final state only if 
€ is accepted by the target DFA A, which is determined via the membership query 
of line 1. 

At each iteration of the loop of lines 4-11, an equivalence query is used. If Ais not 
equivalent to A, then a counter-example string x is received (line 5). If T is the tree 
constructed in the initialization step, then the leaf labeled with NIL is replaced with 
x (lines 6-7). Otherwise, since x is a counter-example, states A(x) and A(x) have a 


~ 


different finality; thus, the string x defining A(a) and the access string U(A(a)) are 
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Figure 13.4 [Illustration of the splitting procedure Spiit(A(a[j — 1])). 


assigned to different equivalence classes by 7’. Thus, there exists a smallest 7 such 
that A(x[j]) and A(2[j]) are not equivalent, that is, such that the prefix a[j] of x 
and the access string U(A(x[j])) are assigned to different leaves by T. j cannot be 0 
since the initialization ensures that A(e) is an initial state and has the same finality 
as the initial state A(e) of A. The equivalence of A(a[j]) and A(z[j]) is tested by 
checking the equality of T(«[j]) and T(U(A(a{j]))), which can be both determined 
using the tree T and membership queries (line 8). 

Now, by definition, A(a[j — 1]) and A(alj — 1}) are equivalent, that is T assigns 
a[j—1] to the leaf labeled with U(A(2[j—1])). But, x[j—1] and U(A(a[j —1])) must 
be distinguished since A(a[j — 1]) and A(clj —1]) admit transitions labeled with 
the same label x; to two non-equivalent states. Let v be a distinguishing string for 
A(2[j]) and A(z[j]). v can be obtained as the least common ancestor of the leaves 
labeled with 2x[j] and U(A(z[j])). To distinguish x[j — 1] and U(A(a[j — 1))), it 
suffices to split the leaf of T labeled with T(a[j — 1]) to create an internal node x;v 
dominating a leaf labeled with «[j — 1] and another one labeled with T(a[j — 1)) 
(line 9). Figure 13.4 illustrates this construction. Thus, this provides a new access 
string x[j — 1] which, by construction, is distinguished from U (A(c[j —1))) and all 
other access strings. 

Thus, the number of access strings (or states of A) increases by one at each 
iteration of the loop. When it reaches the number of states of A, all states of A 
are of the form A(u) for a distinct u € U. A and A have then the same number 
of states and in fact A = A. Indeed, let (A(u),a, A(u’)) be a transition in A, then 
by definition the equality A(ua) = A(u’) holds. The tree T defines a partition 
of all strings in terms of their distinguishing strings in A. Since in A, wa and wu’ 
lead to the same state, they are assigned to the same leaf by 7, that is, the leaf 
labeled with u’. The destination of the transition from A(u) with label a is found 
by CONSTRUCTAUTOMATON() by determining the leaf in T assigned to ua, that 
is, u’. Thus, by construction, the same transition (A(u),a,A(u’)) is created in A. 
Also, a state A(u) of A is final iff w accepted by A that is iff u is assigned to the 
right sub-tree of the root node by T’, which is the criterion determining the finality 


of A(u). Thus, the automata A and A coincide. 


302 Learning Automata and Languages 


Te A counter-example x 
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Figure 13.5 Illustration of the execution of Algorithm QuERYLEARNAUTOMATA() 
for the target automaton A. Each line shows the current decision tree T and the 
tentative DFA A constructed using T. When A is not equivalent to A, the learner 
receives a counter-example zx indicated in the third column. 


The following is the analysis of the running-time complexity of the algorithm. At 
each iteration, one new distinguished access string is found associated to a distinct 
state of A, thus, at most |A| states are created. For each counter-example x, at most 
|x| tree operations are performed. Constructing A requires O(|5||Al]) tree operations. 
The cost of a tree operation is O(|A]) since it consists of at most |A| membership 
queries. Thus, the overall complexity of the algorithm is in O(|5||A|?+n|A]), where 
n is the maximum length of a counter-example. Note that this analysis assumes 
that equivalence and membership queries are made in constant time. 

Our analysis shows the following result. 
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Theorem 13.4 Learning DFAs with queries 
The class of all DFAs is efficiently exactly learnable using membership and equiva- 
lence queries. 


Figure 13.5 illustrates a full execution of the algorithm in a specific case. 
In the next section, we examine a different learning scenario for automata. 


13.4 Identification in the limit 


In the identification in the limit framework, the problem consists of identifying a 
target concept c exactly after receiving a finite set of examples. A class of languages 
is said to be identifiable in the limit if there exists an algorithm that identifies 
any language L in that class after examining a finite number of examples and its 
hypothesis remains unchanged thereafter. 

This framework is perhaps less realistic from a computational point of view since 
it requires no upper bound on the number of instances or the efficiency of the 
algorithm. Nevertheless, it has been argued by some to be similar to the scenario 
of humans learning languages. In this framework as well, negative results hold for 
the general problem of learning DFAs. 


Theorem 13.5 
Deterministic automata are not identifiable in the limit from positive examples. 


Some sub-classes of finite automata can however be successfully identified in the 
limit. Most algorithms for inference of automata are based on a state-partitioning 
paradigm. They start with an initial DFA, typically a tree accepting the finite set 
of sample strings available and the trivial partition: each block is reduced to one 
state of the tree. At each iteration, they merge partition blocks while preserving 
some congruence property. The iteration ends when no other merging is possible. 
The final partition defines the automaton inferred as follows. Thus, the choice of 
the congruence fully determines the algorithm and a variety of different algorithms 
can be defined by varying that choice. A state-splitting paradigm can be similarly 
defined starting from the single-state automaton accepting &*. In this section, we 
present an algorithm for learning reversible automata, which is a special instance 
of the general state-partitioning algorithmic paradigm just described. 

Let A = (©, Q,J, F, E) be a DFA and let z be a partition of Q. The DFA defined 
by the partition 7 is called the automaton quotient of A and 7. It is denoted by 
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A/n and defined as follows: A/a = (5,7, 17, Fr, Ex) with 
I,={Ber: InB#G} 
F,={Béer: FnBFO} 
E, = {(B,a,B’):A(a,a,7)€FE\qeB,d €B',Ber,B' en}. 


Let S be a finite set of strings and let Pref(S') denote the set of prefixes of all strings 
of S. A prefix-tree automaton accepting exactly the set of strings S is a particular 
DFA denoted by PT(S) = (%, Pref(S), {e},S, Hs) where ¥ is the set of alphabet 
symbols used in S' and Eg defined as follows: 


Es = {(x,a,xa): x € Pref(S'), xa € Pref(S)}. 


Figure 13.7a shows the prefix-tree automaton of a particular set of strings S. 
13.4.1 Learning reversible automata 


In this section, we show that the sub-class of reversible automata or reversible 
languages can be identified in the limit. 

Given a DFA A, we define its reverse A® as the automaton derived from A by 
making the initial state final, the final states initial, and by reversing the direction of 
every transition. The language accepted by the reverse of A is precisely the language 
of the reverse (or mirror image) of the strings accepted by A. 


Definition 13.2 Reversible automata 

A finite automaton A is said to be reversible iff both A and A® are deterministic. A 
language L is said to be reversible if it is the language accepted by some reversible 
automaton. 


Some direct consequences of this definition are that a reversible automaton A has 
a unique final state and that its reverse A” is also reversible. Note also that a trim 
reversible automaton A is minimal. Indeed, if states g and q’ in A are equivalent, 
then, they admit a common string x leading both from q and from q’ to a final 
state. But, by the reverse determinism of A, reading the reverse of x from the final 
state must lead to a unique state, which implies that q = q’. 

For any u € &* and any language L C &%*, let Suffy(u) denote the set of all 
possible suffixes in L for u: 


Suff,(u) = {v € X*: uv € Lh. (13.1) 


Suff;(u) is also often denoted by u~!L. Observe that if L is a reversible language 
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L, then the following implication holds for any two strings u, u’ € X*: 
Suffy(u) NSuffz(u’) #0 => Suffr(u) = Suffz(w’). (13.2) 


Indeed, let A be a reversible automaton accepting L. Let q be the state of A 
reached from the initial state when reading u and q’ the one reached reading wu’. If 
v € Suff;(u) N Suff;(u’), then v can be read both from g and q’ to reach the final 
state. Since A” is deterministic, reading back the reverse of v from the final state 
must lead to a unique state, therefore q = q’, that is Suff;(u) = Suff,(u’). 

Let A = (4,Q, {io}, {fo}, Z) be a reversible automaton accepting a reversible 
language L. We define a set of strings S; as follows: 


Si = {dalflal :¢ € Q}U {dlq),a, flq']: 4,4 € Q,a€ S}, 


where d{q] is a string of minimum length from ig to qg, and f[q] a string of minimum 
length from qg to fo. As shown by the following proposition, Sz characterizes the 
language L in the sense that any reversible language containing S, must contain L. 


Proposition 13.1 
Let L be a reversible language. Then, L is the smallest reversible language containing 
St. 


Proof Let L’ be a reversible language containing Sz and let 7 = 11---2%, be 
a string accepted by L, with a, € U for k € [1,n] and n > 1. For convenience, 
we also define xo as €. Let (qo, 21,91) +++ (Qn—1,2n; Qn) be the accepting path in 
A labeled with x. We show by recurrence that Suffz:(%9--:v,) = Suffz-(d[qz]) 
for all & € [0,n]. Since d[go] = d[io] = «€, this clearly holds for k = 0. Now 
assume that Suffz/(vo--:a~) = Suffz-(d[qx]) for some k € [0,n — 1]. This im- 
plies immediately that Suffr/(ao---@pve41) = Suffr/(dlqx|ae41). By definition, 
Sr, contains both d[qg+4i]flqe+i] and d[qx|an41f[qe+1]. Since L’ includes Sz, the 
same holds for L’. Thus, f[qx41] belongs to Suf fr: (d[qr4i) O Suf fr’ (dlqx|vn41)- 
In view of (13.2), this implies that Suffz:(d[qx|an41) = Suffz (d[qx4i]). Thus, we 
have Suffp/(ao-++@p0p41) = Suffz/(dlqz4i]). This shows that Suffp/(ao---a,) = 
Suff;(d[gqz]) holds for all k € [0,n], in particular, for k = n. Note that since 
dn = fo, we have f[gq,] = ¢, therefore d[dn]| = didn] fqn] is in S C L’, which 
implies that Suff;;(d[q,]) contains € and thus that Suff (ao ---2,) contains €. This 
is equivalent tox =2g::-7, € L’. of 


Figure 13.6 shows the pseudocode of an algorithm for inferring a reversible 
automaton from a sample S of m strings 21,...,2 . The algorithm starts by 
creating a prefix-tree automaton A for S$ (line 1) and then iteratively defines a 
partition 7 of the states of A, starting with the trivial partition 7) with one block 
per state (line 2). The automaton returned is the quotient of A and the final partition 
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LEARNREVERSIBLEAUTOMATA(S = (21,..-,2m)) 
1 A=(2%,Q, {io}, F, E) — PT(S) 


2 am <—7o9P trivial partition. 

3 List {(f, f’): f’ © F}e f arbitrarily chosen in F’. 

4 while List 4 0 do 

5 REMOVE(LIST, (q1, G2)) 

6 if B(q,7) # B(qa, 7) then 

7 B, — Bin, 7) 

8 By — B(qo,7) 

9 for alla €¢ % do 
10 if (succ(Bi,a) 4) A (succ( Bo, a) #0) then 
11 ADD(LIST, (succ(By, a), succ( Bo, a))) 
12 if (pred(B,,a) #0 A (pred(B,,a) #9) then 
13 ADD(LIST, (pred( Bi, a), pred( Bz, a))) 
14 UPDATE(succ, pred, By, Bz) 


15 ma — MERGE(7, Bi, Bo) 
16 return A/z 


Figure 13.6 Algorithm for learning reversible automata from a set of positive 
strings S. 


a defined. 

The algorithm maintains a list LIST of pairs of states whose corresponding blocks 
are to be merged, starting with all pairs of final states (f, f’) for an arbitrarily 
chosen final state f € F' (line 3). We denote by B(q, 7) the block containing g based 
on the partition 7. 

For each block B and alphabet symbol a € %, the algorithm also maintains a 
successor succ(B,a), that is, a state that can be reached by reading a from a state 
of B; succ(B,a) = @ if no such state exists. It maintains similarly the predecessor 
pred(B,a), which is a state that admits a transition labeled with a leading to a 
state in B; pred(B,a) = @ if no such state exists. 

Then, while LIST is not empty, a pair is removed from LIST and processed as 
follows. If the pair (q,q,) has not been already merged, the pairs formed by the 
successors and predecessors of B, = B(q,,7) and By = B(q2,7) are added to LIST 
(lines 10-13). Before merging blocks B, and By into a new block B’ that defines 
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Figure 13.7 Example of inference of a reversible automaton. (a) Prefix-tree PT(S) 
representing S = (c,aa,bb,aaaa,abab, abba, baba). (b) Automaton A returned by 
LEARNREVERSIBLEAUTOMATA() for the input S$. A double-direction arrow represents 
two transitions with the same label with opposite directions. The language accepted 
by A is that of strings with an even number of as and bs. 


a new partition 7 (line 15), the successor and predecessor values for the new block 
B’ are defined as follows (line 14). For each symbol a € ©, succ(B’,a) = 0 if 
succ(B,, a) = succ(B2, a) = 0, otherwise succ( B’,a) is set to one of succ( By, a) if it 
is non-empty, succ( Bz, a) otherwise. The predecessor values are defined in a similar 
way. Figure 13.7 illustrates the application of the algorithm in the case of a sample 
with m = 7 strings. 


Proposition 13.2 

Let S be a finite set of strings and let A= PT(S) be the prefiz-tree automaton de- 
fined from S. Then, the final partition defined by LEARNREVERSIBLEAUTOMATA() 
used with input S' is the finest partition m for which A/m is reversible. 


Proof Let T be the number of iterations of the algorithm for the input sample S. 
We denote by 7; the partition defined by the algorithm after t > 1 iterations of the 
loop, with wr the final partition. 

A/tr is a reversible automaton since all final states are guaranteed to be merged 
into the same block as a consequence of the initialization step of line 3 and, for 
any block B, by definition of the algorithm, states reachable by a € © from B are 
contained in the same block, and similarly for those admitting a transition labeled 
with a to a state of B. 

Let x’ be a partition of the states of A for which A/z’ is reversible. We show 
by recurrence that wr refines 7’. Clearly, the trivial partition 7 refines 7’. Assume 
that 7, refines 7’ for all s < t. 441 is obtained from a by merging two blocks 
B(qi,7) and B(q2,7). Since 7, refines 7’, we must have B(qi,7) C B(q,7’) 


and B(q2,7) C B(q2,7’). To show that 7,41 refines x’, it suffices to prove that 
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B(q, 7’) = Bq, 7’). 

A reversible automaton has only one final state, therefore, for the partition 7’, 
all final states of A must be placed in the same block. Thus, if the pair (q1, q2) 
processed at the (¢ + 1)th iteration is a pair of final states placed in LIST at the 
initialization step (line 3), then we must have B(qi,7’) = B(q,7'). Otherwise, 
(q1, 42) was placed in LIST as a pair of successor or predecessor states of two states 
q, and q5 merged at a previous iteration s < t. Since 7, refines 7’, gi and g4 are in 
the same block of z’ and since A/7’ is reversible, gq; and gz must also be in the same 
block as successors or predecessors of the same block for the same label a € %, thus 
Bin, t') = B(q,7’). om 


Theorem 13.6 

Let S' be a finite set of strings and let A be the automaton returned by 
LEARNREVERSIBLEAUTOMATA() when used with input S. Then, L(A) is the small- 
est reversible language containing S. 


Proof Let L be a reversible language containing S, and let A’ be a reversible 
automaton with L(A’) = L. Since every string of S is accepted by A’, any 
u © Pref(S') can be read from the initial state of A’ to reach some state q(w) 
of A’. Consider the automaton A” derived from A’ by keeping only states of the 
form q(u) and transitions between such states. A’ has the unique final state of A’ 
since q(u) is final for u € S, and it has the initial state of A’, since € is a prefix 
of strings of S. Furthermore, A” directly inherits from A’ the property of being 
deterministic and reverse deterministic. Thus, A” is reversible. 

The states of A” define a partition of Pref(S): u,v € Pref(S) are in the same 
block iff g(u) = q(v). Since by definition of the prefix-tree PT(S), its states 
can be identified with Pref(5'), the states of A” also define a partition 7’ of the 
states of PT(S) and thus A” = PT(S)/n'. By proposition 13.2, the partition 
m defined by algorithm LEARNREVERSIBLEAUTOMATA() run with input S is the 
finest such that PT(S)/ is reversible. Therefore, we must have L(PT(S)/m) C 
L(PT(S)/n') = L(A"). Since A” is a sub-automaton of A’, L contains L(A”) and 
therefore L(PT(S)/7) = L(A), which concludes the proof. m 


For the following theorem, a positive presentation of a language L is an infinite 
sequence (%n)nen such that {x,: n € N} = L. Thus, in particular, for any « € L 
there exists n € N such that « = x,. An algorithm identifies Z in the limit from a 
positive presentation if there exists N € N such that for n > N the hypothesis it 
returns is L. 


Theorem 13.7 Identification in the limit of reversible languages 
Let L be a reversible language, then algorithm LEARNREVERSIBLEAUTOMATA() 
identifies L in the limit from a positive presentation. 
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Proof Let L be a reversible language. By proposition 13.1, L admits a finite 
characteristic sample Si. Let (an)nen be a positive presentation of L and let 
X, denote the union of the first n elements of the sequence. Since Sz, is finite, 
there exists N > 1 such that S; C Xy. By theorem 13.6, for any n > N, 
LEARNREVERSIBLEAUTOMATA() run on the finite sample X,, returns the smallest 
reversible language L’ containing X,, a fortiori S;, which, by definition of S',, implies 
that L'’=L. 


The main operations needed for the implementation of the algorithm for learning 
reversible automata are the standard FIND and UNION to determine the block a 
state belongs to and to merge two blocks into a single one. Using a disjoint-set 
data structure for these operations, the time complexity of the algorithm can be 
shown to be in O(na(n)), where n denotes the sum of the lengths of all strings 
in the input sample S' and a(n) the inverse of the Ackermann function, which is 
essentially constant (a(n) < 4 for n < 10°°). 


13.5 Chapter notes 


For an overview of finite automata and some related recent results, see Hopcroft 
and Ullman [1979] or the more recent Handbook chapter by Perrin [1990], as well 
as the series of books by M. Lothaire [Lothaire, 1982, 1990, 2005]. 

Theorem 13.1, stating that the problem of finding a minimum consistent DFA is 
NP-hard, is due to Gold [1978]. This result was later extended by Angluin [1978). 
Pitt and Warmuth [1993] further strengthened these results by showing that even an 
approximation within a polynomial function of the size of the smallest automaton 
is NP-hard (theorem 13.2). Their hardness results apply also to the case where 
prediction is made using NFAs. Kearns and Valiant [1994] presented hardness results 
of a different nature relying on cryptographic assumptions. Their results imply that 


no polynomial-time algorithm can learn consistent NFAs polynomial in the size of 
the smallest DFA from a finite sample of accepted and rejected strings if any of 
the generally accepted cryptographic assumptions holds: if factoring Blum integers 
is hard; or if the RSA public key cryptosystem is secure; or if deciding quadratic 
residuosity is hard. 

On the positive side, Trakhtenbrot and Barzdin [1973] showed that the smallest 
finite automaton consistent with the input data can be learned exactly from a 
uniform complete sample, whose size is exponential in the size of the automaton. 
The worst-case complexity of their algorithm is exponential, but a better average- 
case complexity can be obtained assuming that the topology and the labeling are 
selected randomly [Trakhtenbrot and Barzdin, 1973] or even that the topology is 
selected adversarially [Freund et al., 1993]. 
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Cortes, Kontorovich, and Mohri [2007a] study an approach to the problem of 
learning automata based on linear separation in some appropriate high-dimensional 
feature space; see also Kontorovich et al. [2006, 2008]. The mapping of strings to 
that feature space can be defined implicitly using the rational kernels presented in 
chapter 5, which are themselves defined via weighted automata and transducers. 

The model of learning with queries was introduced by Angluin [1978], who also 
proved that finite automata can be learned in time polynomial in the size of 
the minimal automaton and that of the longest counter-example. Bergadano and 
Varricchio [1995] further extended this result to the problem of learning weighted 
automata defined over any field. Using the relationship between the size of a minimal 
weighted automaton over a field and the rank of the corresponding Hankel matrix, 
the learnability of many other concepts classes such as disjoint DNF can be shown 
[Beimel et al., 2000]. Our description of an efficient implementation of the algorithm 
of Angluin [1982] using decision trees is adapted from Kearns and Vazirani [1994]. 

The model of identification in the limit of automata was introduced and analyzed 
by Gold [1967]. Deterministic finite automata were shown not to be identifiable in 
the limit from positive examples [Gold, 1967]. But, positive results were given for 
the identification in the limit of a number of sub-classes, such as the family of k- 
reversible languages Angluin [1982] considered in this chapter. Positive results also 
hold for learning subsequential transducers Oncina et al. [1993]. Some restricted 
classes of probabilistic automata such as acyclic probabilistic automata were also 
shown by Ron et al. [1995] to be efficiently learnable. 

There is a vast literature dealing with the problem of learning automata. In 
particular, positive results have been shown for a variety of sub-families of finite 
automata in the scenario of learning with queries and learning scenarios of different 
kinds have been introduced and analyzed for this problem. The results presented in 
this chapter should therefore be viewed only as an introduction to that material. 


13.6 Exercises 


13.1 Minimal DFA. Show that a minimal DFA A also has the minimal number 
of transitions among all other DFAs equivalent to A. Prove that a language L is 
regular iff Q = {Suffz(u): wu € &*} is finite. Show that the number of states of a 
minimal DFA A with L(A) = L is precisely the cardinality of Q. 


13.2 VC-dimension of finite automata. 


(a) What is the VC-dimension of the family of all finite automata? What does 
that imply for PAC-learning of finite automata? Does this result change if we 
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restrict ourselves to learning acyclic automata (automata with no cycles)? 


(b) Show that the VC-dimension of the family of DFAs with at most n states 
is bounded by O(||n log n). 


13.3 PAC learning with membership queries. Give an example of a concept class C 
that is efficiently PAC-learnable with membership queries but that is not efficiently 
exactly learnable. 


13.4 Learning monotone DNF formulae with queries. Show that the class of mono- 
tone DNF formulae over n variables is efficiently exactly learnable using membership 
and equivalence queries. (Hint: a prime implicant t of a formula f is a product of 
literals such that t implies f but no proper sub-term of ¢ implies f. Use the fact 
that for monotone DNF, the number of prime implicants is at the most the number 
of terms of the formula.) 


13.5 Learning with unreliable query responses. Consider the problem where the 
learner must find an integer x selected by the oracle within [1,n], where n > 1 is 
given. To do so, the learner can ask questions of the form (a < m?) or (a > m?) 
for m € [1,n]. The oracle responds to these questions but may give an incorrect 
response to k questions. How many questions should the learner ask to determine 
x? (Hint: observe that the learner can repeat each question 2k + 1 times and use 
the majority vote.) 


13.6 Algorithm for learning reversible languages. What is the DFA A returned 
by the algorithm for learning reversible languages when applied to the sample 
S = {ab, aaabb, aabbb, aabbbb}? Suppose we add a new string to the sample, say 
x = abab. How should A be updated to compute the result of the algorithm for 
SU{ax}? More generally, describe a method for updating the result of the algorithm 
incrementally. 


13.7 k-reversible languages. A finite automaton A’ is said to be k-deterministic if 
it is deterministic modulo a lookahead k: if two distinct states p and q are both 
initial, or are both reached from another state r by reading a € UJ, then no string 
u of length k can be read in A’ both from p and q. A finite automaton A is said to 
be k-reversible if it is deterministic and if A® is k-deterministic. A language L is 
k-reversible if it is accepted by some k-reversible automaton. 


(a) Prove that L is k-reversible iff for any strings u, u’,v € X* with |v| = k, 


Suff; (wv) N Suffy(u'v) 40 = > Suff, (uv) = Suff,(u'v). 
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(b) Show that a k-reversible language admits a characteristic language. 

(c) Show that the following defines an algorithm for learning k-reversible 
automata. Proceed as in the algorithm for learning reversible automata but 
with the following merging rule instead: merge blocks B; and Bp if they can 
be reached by the same string u of length & from some other block and if By, 
and By are both final or have a common successor. 


14 Reinforcement Learning 


This chapter presents an introduction to reinforcement learning, a rich area of 
machine learning with connections to control theory, optimization, and cognitive 
sciences. Reinforcement learning is the study of planing and learning in a scenario 
where a learner actively interacts with the environment to achieve a certain goal. 
This active interaction justifies the terminology of agent used to refer to the learner. 
The achievement of the agent’s goal is typically measured by the reward he receives 
from the environment and which he seeks to maximize. 

We first introduce the general scenario of reinforcement learning and then intro- 
duce the model of Markov decision processes (MDPs), which is widely adopted in 
this area, as well as essential concepts such as that of policy or policy value related 
to this model. The rest of the chapter presents several algorithms for the planning 
problem, which corresponds to the case where the environment model is known to 
the agent, and then a series of learning algorithms for the more general case of an 
unknown model. 


14.1. Learning scenario 


The general scenario of reinforcement learning is illustrated by figure 14.1. Unlike 
the supervised learning scenario considered in previous chapters, here, the learner 
does not passively receive a labeled data set. Instead, he collects information 
through a course of actions by interacting with the environment. In response to 
an action, the learner or agent, receives two types of information: his current state 
in the environment, and a real-valued reward, which is specific to the task and its 
corresponding goal. 

There are several differences between the learning scenario of reinforcement 
learning and that of supervised learning examined in most of the previous chapters. 
Unlike the supervised learning scenario, in reinforcement learning there is no fixed 
distribution according to which instances are drawn; the choice of a policy defines 
the distribution. In fact, slight changes to the policy may have dramatic effects on 
the rewards received. Furthermore, in general, the environment may not be fixed 
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Figure 14.1 Representation of the general scenario of reinforcement learning. 


and could vary as a result of the actions selected by the agent. This may be a more 
realistic model for some learning problems than the standard supervised learning. 

The objective of the agent is to maximize his reward and thus to determine 
the best course of actions, or policy, to achieve that objective. However, the 
information he receives from the environment is only the immediate reward related 
to the action just taken. No future or long-term reward feedback is provided by 
the environment. An important aspect of reinforcement learning is to take into 
consideration delayed rewards or penalties. The agent is faced with the dilemma 
between exploring unknown states and actions to gain more information about the 
environment and the rewards, and exploiting the information already collected to 
optimize his reward. This is known as the exploration versus exploitation trade- 
off inherent in reinforcement learning. Note that within this scenario, training and 
testing phases are intermixed. 

Two main settings can be distinguished here: the case where the environment 
model is known to the agent, in which case his objective of maximizing the reward 
received is reduced to a planning problem, and the case where the environment 
model is unknown, in which case he faces a learning problem. In the latter case, 
the agent must learn from the state and reward information gathered to both 
gain information about the environment and determine the best action policy. This 
chapter presents algorithmic solutions for both of these settings. 


14.2 Markov decision process model 


We first introduce the model of Markov decision processes (MDPs), a model of the 
environment and interactions with the environment widely adopted in reinforcement 
learning. An MDP is a Markovian process defined as follows. 


Definition 14.1 MDPs 
A Markov decision process (MDP) is defined by: 
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= a set of states S, possibly infinite. 

= a start state or initial state sg € S. 

= a set of actions A, possibly infinite. 

= « transition probability Pr[s’|s, a]: distribution over destination states s’ = 0(s,a). 


= a reward probability Pr[r’|s,a]: distribution over rewards returned r' = r(s,a). 


The model is Markovian because the transition and reward probabilities depend 
only on the current state s and not the entire history of states and actions taken. 
This definition of MDP can be further generalized to the case of non-discrete state 
and action sets. 

In a discrete-time model, actions are taken at a set of decision epochs {0,...,T}, 
and this is the model we will adopt in what follows. This model can also be 
straightforwardly generalized to a continuous-time one where actions are taken at 
arbitrary points in time. 

When T is finite, the MDP is said to have a finite horizon. Independently of the 
finiteness of the time horizon, an MDP is said to be finite when both S and A are 
finite sets. Here, we are considering the general case where the reward r(s,a) at 
state s when taking action a is a random variable. However, in many cases, the 
reward is assumed to be a deterministic function of the pair of the state and action 
pair (s,a). 

Figure 14.2 illustrates the model corresponding to an MDP. At time ¢ € [0,7] 
the state observed by the agent is s, and he takes action a; € A. The state reached 
is $441 (with probability Pr[sz41|az, s;]) and the reward received r:41 € R (with 
probability Pr[rz+1|az, s:]). 

Many real-world tasks can be represented by MDPs. Figure 14.3 gives the example 
of a simple MDP for a robot picking up balls on a tennis court. 


14.3. Policy 


The main problem for an agent in an MDP environment is to determine the action 
to take at each state, that is, an action policy. 


14.3.1 Definition 


Definition 14.2 Policy 
A policy is a mapping 7: S — A. 


More precisely, this is the definition of a stationary policy since the choice of the 
action does not depend on the time. More generally, we could define a non-stationary 
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Figure 14.2 [llustration of the states and transitions of an MDP at different times. 


policy as a sequence of mappings 7: S — A indexed by t. In particular, in the finite 
horizon case, typically a non-stationary policy is necessary. 

The agent’s objective is to find a policy that maximizes his expected (reward) 
return. The return he receives following a policy 7 along a specific sequence of states 
St,-..-, 87 is defined as follows: 


® finite horizon (T’ < oo): eae (Seer, (St4r))- 
= infinite horizon (T = oo): a Y7r(St47,7(St+7)), Where y € [0,1) is a constant 


factor less than one used to discount future rewards. 


Note that the return is a single scalar summarizing a possibly infinite sequence 
of immediate rewards. In the discounted case, early rewards are viewed as more 
valuable than later ones. 

This leads to the following definition of the value of a policy at each state. 


14.3.2 Policy value 


Definition 14.3 Policy value 
The value V,(s) of a policy m at state s € S is defined as the expected reward 
returned when starting at s and following policy m: 


® finite horizon: V,(s) =E Ese (Sines (Si) |S = 8]? 

= infinite discounted horizon: V;(s) =E Ea yr (Step, 7(St47)) [Sz = s| ; 

where the expectations are over the random selection of the states s, and the reward 
values 441. An infinite undiscounted horizon is also often considered based on the 
limit of the average reward, when it exists. 


As we shall see later, there exists a policy that is optimal for any start state. In view 
of the definition of the policy values, seeking the optimal policy can be equivalently 
formulated as determining a policy with maximum value at all states. 


14.3.3 Policy evaluation 


The value of a policy at state s can be expressed in terms of its values at other 
states, forming a system of linear equations. 
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Figure 14.3. Example of a simple MDP for a robot picking up balls on a tennis 
court. The set of actions is A = {search, carry, pickup} and the set of states reduced 
to S = {start, other}. Each transition is labeled with the action followed by the 
probability of the transition probability and the reward received after taking that 
action. Ri, R2, and R3 are real numbers indicating the reward associated to each 
transition (case of deterministic reward). 


Proposition 14.1 Bellman equation 
The values V,(s) of policy 7 at states s € S for an infinite horizon MDP obey the 
following system of linear equations: 


Vs € S, Vr(s) = Elr(s, x(s)] +7 _ Pris’|s, 7(s)]V;(s’). (14.1) 


Proof We can decompose the expression of the policy value as a sum of the first 
term and the rest of the terms: 


V,(s) =E [5 artoete ets) lae=s| 


T=0 T-t 
See Ole bs tee eee eee 
T=0 


— E[r(s, ™(s)] Ir yE[V,(0(s, ™(s)))], 


since we can recognize the expression of V,(d(s,7(s))) in the expectation of the 
second line. m 


The Bellman equations can be rewritten as 
V=R+ PV, (14.2) 


using the following notation: P denotes the transition probability matrix defined 
by P..5/ = Pr[s’|s,7(s)] for all s,s’ € S; V is the value column matrix whose sth 
component is V, = V,(s); and R the reward column matrix whose sth component 
is Rs = E[r(s, 7(s)]. V is typically the unknown variable in the Bellman equations 
and is determined by solving for it. The following theorem shows that for a finite 
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MDP this system of linear equations admits a unique solution. 


Theorem 14.1 
For a finite MDP, Bellman’s equation admits a unique solution given by 


Vo = (I— yP)“'R. (14.3) 
Proof The Bellman equation (14.2) can be equivalently written as 
(I-7P)V=R. 


Thus, to prove the theorem it suffices to show that (I— yP) is invertible. To do so, 
note that the norm infinity of P can be computed using its stochasticity properties: 


[Plleg imax $7 [Pyer|—= max ¥— Pris'|s,a(s)| = 1. 


This implies that ||yP||.. = y < 1. The eigenvalues of P are thus all less than one, 
and (I— yP) is invertible. m 


Thus, for a finite MDP, when the transition probability matrix P and the reward 
expectations R are known, the value of policy 7 at all states can be determined by 
inverting a matrix. 


14.3.4 Optimal policy 


The objective of the agent can be reformulated as that of seeking the optimal policy 
defined as follows. 


Definition 14.4. Optimal policy 
A policy x* is optimal if it has maximal value for all states s € S. 


Thus, by definition, for any s € S, V,«(s) = max, V,(s). We will use the shorter 
notation V* instead of V,«. V*(s) is the maximal cumulative reward the agent can 
expect to receive when starting at state s. 


Definition 14.5 State-action value function 

The optimal state-action value function Q* is defined for all (s,a) € S x A as the 
expected return for taking actiona € A at state s € S and then following the optimal 
policy: 


Q*(s,a) =E[r(s,a)] + 5© Pris’ | s,a]V*(s’). (14.4) 
s'ES 
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It is not hard to see then that the optimal policy values are related to Q* via 
Vs eS, V*(s) = max Q"(s, a). (14.5) 
ae 


Indeed, by definition, V*(s) < maxgea Q*(s,a) for all s € S. If for some s we had 
V*(s) < maxaea Q*(s,a), then then maximizing action would define a better policy. 
Observe also that, by definition of the optimal policy, we have 


Vs € S, m*(s) = argmax Q*(s, a). (14.6) 
acA 
Thus, the knowledge of the state-value function Q* is sufficient for the agent 
to determine the optimal policy, without any direct knowledge of the reward or 
transition probabilities. Replacing Q* by its definition in (14.5) gives the following 
system of equations for the optimal policy values V*(s): 
V*(s) = max { E[r(s,a)] +7 s Pr[s'|s,a]V*(s')}, (14.7) 
acA 
8’ES 

also known as Bellman equations. Note that this new system of equations is not 
linear due to the presence of the max operator. It is distinct from the previous linear 
system we defined under the same name in (14.1) and (14.2). 


14.4 Planning algorithms 


In this section, we assume that the environment model is known. That is, the 
transition probability Pr[s’|s, a] and the expected reward E[r(s,a)] for all s,s’ € S 
and a € A are assumed to be given. The problem of finding the optimal policy then 
does not require learning the parameters of the environment model or estimating 
other quantities helpful in determining the best course of actions, it is purely a 
planning problem. 

This section discusses three algorithms for this planning problem: the value 
iteration algorithm, the policy iteration algorithm, and a linear programming 
formulation of the problem. 


14.4.1 Value iteration 


The value iteration algorithm seeks to determine the optimal policy values V*(s) 
at each state s € S, and thereby the optimal policy. The algorithm is based on 
the Bellman equations (14.7). As already indicated, these equations do not form 
a system of linear equations and require a different technique to determine the 
solution. The main idea behind the design of the algorithm is to use an iterative 
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VALUEITERATION( Vo) 
1 V+#Vo 2 Vo arbitrary value 
2 while ||\V — &(V)|| > “— do 
3 V&V) 
4 return ®(V) 


Figure 14.4 Value iteration algorithm. 


method to solve them: the new values of V(s) are determined using the Bellman 
equations and the current values. This process is repeated until a convergence 
condition is met. 

For a vector V in R'%!, we denote by V(s) its sth coordinate, for any s € S. Let 
®: R'S! — RIS! be the mapping defined based on Bellman’s equations (14.7): 


Vs € S,[®(V)](s) = max { E[r(s,a)] + 7 os Pris’ |s, alV(s')}. (14.8) 
acA 
s'ES 
The maximizing actions a € A in these equations define an action to take at each 
state s € S, that is a policy 7. We can thus rewrite these equations in matrix terms 
as follows: 


@(V) = max{Rx +7P,V}, (14.9) 


where P,, is the transition probability matrix defined by (Px)ss: = Pr[s’|s,(s)| 
for all s,s’ € S, and R, the reward vector defined by (Rx)s = E/r(s,7(s)], for all 
ses. 

The algorithm is directly based on (14.9). The pseudocode is given above. Starting 
from an arbitrary policy value vector Vp € R'*!, the algorithm iteratively applies 
® to the current V to obtain a new policy value vector until ||V — ®(V)|| < 
eu where € > 0 is a desired approximation. The following theorem proves the 
convergence of the algorithm to the optimal policy values. 


Theorem 14.2 
For any initial value Vo, the sequence defined by Vni1 = ®(V,) converges to V*. 


Proof We first show that ® is y-Lipschitz for the || - ||,,.! For any s € S and 


1. A @-Lipschitz function with G < 1 is also called 3-contracting. In a complete metric 
space, that is a metric space where any Cauchy sequence converges to a point of that 
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V €RISI, let a*(s) be the maximizing action defining &(V)(s) in (14.8). Then, for 
any s € S and any UE R'‘I, 


®(V)(s) — B(U)(s) < &(V)(s) — (Elr(s, a*(s))] +7 >) Pris’ | s, a*(s)U(s')) 


s'ES 
=7 >> Prfs'|s, a*(s)][V(s’) — U(s')] 
s'ES 
<7 S5 Pr[s'|s,a*(s)]|[V — Ulloo = 7|[V — Ulloo. 
s'ES 


Proceeding similarly with ®(U)(s) — ®(V)(s), we obtain ®(U)(s) — ®(V)(s) < 
¥||V — Ul|o. Thus, |®(V)(s) — ®(U)(s)| < y|/V — Ul]. for all s, which implies 


|@(V) — BU) |lo0 < YTV — Ulloo, 


that is the y-Lipschitz property of ®. Now, by Bellman equations (14.7), V* = 
®(V*), thus for any n EN, 


I[V* — Vasalloo = || PCV") — B(Vn) loo < YTV" — Vnllo < rey — Volloe; 
which proves the convergence of the sequence to V* since y € (0,1). m 


The ¢-optimality of the value returned by the algorithm can be shown as follows. 
By the triangle inequality and the y-Lipschitz property of ®, for any n € N, 


|V* — Vatilloo < |v~ _ B(Vin+1)|loo + |P(Vin41) ~~ Vatilloo 
= |]®(V*) — ®(Vn41) loo + || @(Vn41) — (Vn) loo 
SAV" = Veallee yl] Vat — Vallee: 


Thus, if Vp41 is the policy value returned by the algorithm, we have 


'¥* —Vatalloo < Tog lVnns = Valle < €. 


The convergence of the algorithm is in O(log +) number of iterations. Indeed, observe 
that 


Vine — Valles = lP(Vn)-—P(Va-i)lloo = Yl Wn - Vagtlleo <7" || P( Vo) — Volleo- 


Thus, if n is the largest integer such that at < |[Wn41 — Wnlloo, it must verify 


space, a 3-contracting function f admits a fired point: any sequence (f(%n))nen converges 
to some « with f(x) = x. RY, N > 1, or, more generally, any finite-dimensional vector 
space, is a complete metric space. 


322 Reinforcement Learning 


a/[3/4, 2] 


a/[1/4, 2] 


Figure 14.5 Example of MDP with two states. The state set is reduced to 
S = {1,2} and the action set to A = {a,b,c,d}. Only transitions with non-zero 
probabilities are represented. Each transition is labeled with the action taken 
followed by a pair [p,r] after a slash separator, where p is the probability of the 
transition and r the expected reward for taking that transition. 


(ade < ¥"||®(Vo) — Volloo and therefore n < O(log L\? 
Figure 14.5 shows a simple example of MDP with two states. The iterated values 
of these states calculated by the algorithm for that MDP are given by 


Vnti(1) = max {2 o(GVn(0) ! £Vn(2)),2+9Vn(2)} 


Vi41(2) = max {3 +97V,,(1),2+ qvV.(2)}. 


For Vo(1) = —1, Vo(2) = 1, and y = 1/2, we obtain Vi(1) = Vi(2) = 5/2. 
Thus, both states seem to have the same policy value initially. However, by the fifth 
iteration, W5(1) = 4.53125, V5(2) = 5.15625 and the algorithm quickly converges 
to the optimal values V*(1) = 14/3 and V*(2) = 16/3 showing that state 2 has a 
higher optimal value. 


14.4.2 Policy iteration 


An alternative algorithm for determining the best policy consists of using policy 
evaluations, which can be achieved via a matrix inversion, as shown by theorem 14.1. 
The pseudocode of the algorithm known as policy iteration algorithm is given in 
figure 14.6. Starting with an arbitrary action policy 7, the algorithm repeatedly 
computes the value of the current policy 7 via that matrix inversion and greedily 
selects the new policy as the one maximizing the right-hand side of the Bellman 
equations (14.9). 

The following theorem proves the convergence of the policy iteration algorithm. 


Theorem 14.3 


2. Here, the O-notation hides the dependency on the discount factor 7. As a function of 
y, the running time is not polynomial. 
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POLICYITERATION(7) 


1 m<—7 Po arbitrary policy 


2 7’ —NIL 

3. while (7 4 7’) do 

4 V+vV, Pp policy evaluation: solve (I— yP,)V = R,. 

5 Tw 7 

6 m7 — argmax,{R, +7P,V}  p greedy policy improvement. 
7 return 7 


Figure 14.6 Policy iteration algorithm. 


Let (Vn)nen be the sequence of policy values computed by the algorithm, then, for 
anyn EN, the following inequalities hold: 


Vi Ves Vv (14.10) 


Proof Let m+: be the policy improvement at the nth iteration of the algorithm. 
We first show that (I — 7P,,,,,)~' preserves ordering, that is, for any column 
matrices X and Y in R!°!, if (Y — X) > 0, then (I — yP,,,,) 1(Y —X) > 0. 
As shown in the proof of theorem 14.1, ||/yP||., = y < 1. Since the radius of 
convergence of the power series (1 —x)~1 is one, we can use its expansion and write 


(I- Pic) = Pre) 
Thus, if Z = (Y — X) > 0, then (I— yP 


the entries of matrix P 
Z. 


teal 2 = Sol mea) 2 = OU, Siice 
w+, and its powers are all non-negative as well as those of 
Now, by definition of 7,,1, we have 


Rang + ¥Pangi Wn 2 Re, + YP, Vn = Vn, 


which shows that R,,,,, > (I-—7Pxn4,)Vn- Since (I-7Px,,,,)~* preserves ordering, 
this implies that Vn41 = (I— yPzx,,,)7'Ra, 4; > Wn, which concludes the proof 
of the theorem. 


Note that two consecutive policy values can be equal only at the last iteration of 
the algorithm. The total number of possible policies is |A|!‘!, thus this constitutes 
a straightforward upper bound on the maximal number of iterations. Better upper 
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bounds of the form o(4) are known for this algorithm. 

For the simple MDP shown by figure 14.5, let the initial policy 7 be defined by 
to(1) = b, 70(2) = c. Then, the system of linear equations for evaluating this policy 
is 


Varo (1) = 1 + Wao (2) 
24+ V7, (2), 
; + — ity = 22 
which gives V,,(1) = 7-4 and V,,(2) = 7-5 


Theorem 14.4 
Let (Un)nen be the sequence of policy values generated by the value iteration 


algorithm, and (Vn)nen the one generated by the policy iteration algorithm. If 
Uo = Vo; then, 


Vn EN, Un < Vn < V*. (14.11) 


Proof We first show that the function ® previously introduced is monotonic. Let 
U and V be such that U < V and let z be the policy such that ®(U) = R,+7P,U. 
Then, 


®(U) <R,+ 7PrV < max{R, + yPxV} = ®(V). 


The proof is by induction on n. Assume that U, < Vx, then by the monotonicity 
of ®, we have 


Uns = ®(U,,) < ®(V,,) = max{Rx + 7P,Vn}. 
Let 741 be the maximizing policy, that is, 7,41 = argmax,{R,+7P,V»}. Then, 
®(V,) = Ragu + VPaagi Vn < Bangs +P regs Vat = Vat: 
and thus Uni < Vn4i-. Of 


The theorem shows that the policy iteration algorithm converges in a smaller 
number of iterations than the value iteration algorithm due to the optimal policy. 
But, each iteration of the policy iteration algorithm requires computing a policy 
value, that is, solving a system of linear equations, which is more expensive to 
compute that an iteration of the value iteration algorithm. 


14.4.3 Linear programming 


An alternative formulation of the optimization problem defined by the Bellman 
equations (14.7) is via linear programming (LP), that is an optimization prob- 
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lem with a linear objective function and linear constraints. LPs admit (weakly) 
polynomial-time algorithmic solutions. There exist a variety of different methods 
for solving relative large LPs in practice, using the simplex method, interior-point 
methods, or a variety of special-purpose solutions. All of these methods could be 
applied in this context. 

By definition, the equations (14.7) are each based on a maximization. These 
maximizations are equivalent to seeking to minimize all elements of {V(s): s € S} 
under the constraints V(s) > E[r(s,a@)] + y eg Pr[s’|s, a]V(s’), (s € S). Thus, 
this can be written as the following LP for any set of fixed positive weights a(s) > 0, 
(s € S): 


min S- a(s)V(s) (14.12) 


subject to Vs € S,Va € A, V(s) > E[r(s,a)] +7 iS Pr{s’|s, a]V(s‘), 
ES 


where a > 0 is the vector with the sth component equal to a(s).? To make each 
coefficient a(s) interpretable as a probability, we can further add the constraints that 
diseg (8) = 1. The number of rows of this LP is |.$||A| and its number of columns 
|S|. The complexity of the solution techniques for LPs is typically more favorable in 
terms of the number of rows than the number of columns. This motivates a solution 
based on the equivalent dual formulation of this LP which can be written as 


max “o E[r(s, a)] x(s, a) (14.13) 
* sE€S,acA 
subject to Vs € S, oe x(s’,a) =a(s') +7 SS Pr{s’|s, a] x(s’, a) 
acA sES,acA 


Vs € S,Va € A,a(s,a) > 0, 


and for which the number of rows is only || and the number of columns |,S]| A]. 
Here x(s,a) can be interpreted as the probability of being in state s and taking 
action a. 


14.5 Learning algorithms 


This section considers the more general scenario where the environment model of 
an MDP, that is the transition and reward probabilities , is unknown. This matches 


3. Let us emphasize that the LP is only in terms of the variables V(s), as indicated by 
the subscript of the minimization operator, and not in terms of V(s) and a(s). 
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many realistic applications of reinforcement learning where, for example, a robot is 
placed in an environment that it needs to explore in order to reach a specific goal. 

How can an agent determine the best policy in this context? Since the environment 
models are not known, he may seek to learn them by estimating transition or reward 
probabilities. To do so, as in the standard case of supervised learning, the agent 
needs some amount of training information. In the context of reinforcement learning 
with MDPs, the training information is the sequence of immediate rewards the agent 
receives based on the actions he has taken. 

There are two main learning approaches that can be adopted. One known as the 
model-free approach consists of learning an action policy directly. Another one, a 
model-based approach, consists of first learning the environment model, and then 
use that to learn a policy. The Q-learning algorithm we present for this problem is 
widely adopted in reinforcement learning and belongs to the family of model-free 
approaches. 

The estimation and algorithmic methods adopted for learning in reinforcement 
learning are closely related to the concepts and techniques in stochastic approxi- 
mation. Thus, we start by introducing several useful results of this field that will 
be needed for the proofs of convergence of the reinforcement learning algorithms 
presented. 


14.5.1 Stochastic approximation 


Stochastic approximation methods are iterative algorithms for solving optimization 
problems whose objective function is defined as the expectation of some random 
variable, or to find the fixed point of a function H that is accessible only through 
noisy observations. These are precisely the type of optimization problems found in 
reinforcement learning. For example, for the Q-learning algorithm we will describe, 
the optimal state-action value function Q* is the fixed point of some function H 
that is defined as an expectation and thus not directly accessible. 

We start with a basic result whose proof and related algorithm show the flavor 
of more complex ones found in stochastic approximation. The theorem is a gener- 
alization of a result known as the strong law of large numbers. It shows that under 
some conditions on the coefficients, an iterative sequence of estimates jim converges 
almost surely (a.s.) to the mean of a bounded random variable. 


Theorem 14.5 Mean estimation 
Let X be a random variable taking values in [0,1] and let xo,...,U%m be t.i.d. values 
of X. Define the sequence (Jlm)men by 


bm+1 = (1—Om)Um + mem, (14.14) 
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with io = 20, Am € [0,1], a, Qm = +oo and seer az, < too. Then, 
Lm —> E[X]. (14.15) 


Proof We give the proof of the Lz convergence. The a.s. convergence is shown 
later for a more general theorem. By the independence assumption, for m > 0, 


Var|[tm4i] = (1 — Om)? Vat[im] + a2, Var[zm] < (1 — Om) Var[tm] + a2,. (14.16) 


Let € > 0 and suppose that there exists N € N such that for allm > N, Var[{m] > €. 
Then, form > N, 


Var[-m+41] < Var[um] — am Var[im|] + a2, < Var[~m] — ame + a2, 


which implies, by reapplying this inequality, that 


m+N m+N 
Var[Lm+n] < Var[un] — € > Qn + S- a, 
n=N n=N 


——oo when m—oo 


contradicting Var[t4m+n] > 0. Thus, this contradicts the existence of such an integer 
N. Therefore, for all N € N, there exists mo > N such that Var[fim,] < €. 

Choose N large enough so that for all m > N, the inequality a, < € holds. This 
2 men and thus (@m)men converges to zero in view 
of Do ns9 2, < +00. We will show by induction that for any m > mo, Var[um] < €, 
which implies the statement of the theorem. 

Assume that Var|um,| < ¢€ for some m > mpg. Then, using this assumption, 


inequality 14.16, and the fact that a, < ¢, the following inequality holds: 


is possible since the sequence (a 


Var|[tm+i] < (1 — Qm)e + €Am = €. 


Thus, this proves that lim—+o. Var[{m] = 0, that is the Dz convergence of [4, to 
E|X]. 


Note that the hypotheses of the theorem related to the sequence (Qm)men hold in 
particular when a, = 4. The special case of the theorem with this choice of a,» 
coincides with the strong law of large numbers. This result has tight connections 
with the general problem of stochastic optimization. 

Stochastic optimization is the general problem of finding the solution to the 
equation 


x = H(x), 


where x € RY, when 
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= H(x) cannot be computed, for example, because H is not accessible or because 
the cost of its computation is prohibitive; 

= but an iid. sample of m noisy observations H(x;) + w; are available, i € [1, mJ, 
where the noise random variable w has expectation zero: E[w] = 0. 


This problem arises in a variety of different contexts and applications. As we shall 
see, it is directly related to the learning problem for MDPs. 

One general idea for solving this problem is to use an iterative method and define 
a sequence (x;)zen in a way similar to what is suggested by theorem 14.5: 


Xep1 = (1 — a¢)xe + O4[H (xz) + wi] (14.17) 
=xX;+ Qt [H (xz) + Wt — xz], (14.18) 


where (az)zen follow conditions similar to those assumed in theorem 14.5. More 
generally, we consider sequences defined via 


Xt41 = Xe + a4 D(xz, wt); (14.19) 


where D is a function mapping RY x R% to R%. There are many different theorems 
guaranteeing the convergence of this sequence under various assumptions. We will 
present one of the most general forms of such theorems, which relies on the following 


general result. 


Theorem 14.6 Supermartingale convergence 

Let (Xt)ten, (Yr)ten, and (Zz)ten be sequences of non-negative random variables 
such that Yo 6 < oo. Let F, denote all the information for t' < t: Fy = 
{( Xe er<t, (Yer er<ts (Ze er<e}. Then, if E [XealFi] < Xi + Yi — Z, the following 
holds: 


= X, converges to a limit (with probability one). 
LJ pare Zt < ©. 


The following is one of the most general forms of such theorems. 


Theorem 14.7 

Let D be a function mapping RN x R® to RY, (xz)ren and (we)ten two sequences 

in RX, and (at)ten a sequence of real numbers with x441 = x: + a4D(x;, we). Let 

F,, denote the entire history for t' <t, that is: Fy = {(xw )u<t, (we )u<e, (ae )er<t}- 
Let UV denote x — 4||x — x*||3 for some x* € RN and assume that D and (a)ten 

verify the following conditions: 


= 4K,,Ko€R:E [ID (xe, we) 3 | Fi < K, + Ky V(x); 


a 
| 


dc > 0: VU(x;)' E [D (xt, we) | F| < —cU(x;); 
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aay >) pO = Oe < 
Then, the sequence x; converges almost surely to x*: 
x — x", (14.20) 


Proof Since function W is quadratic, a Taylor expansion gives 


W (x41) = W(x) + VU (x) | (xi41 Xz) t = (oes x)! V7U (xz) (Xt41 = Xz). 


Thus, 
T ae 2 
E [wxes)| Fi] => W (xz) + a,VU (xz) E [ D(x, we)|F| + 2 E [ID (xe, we) \F] 


< W(x) — apc (xy) + SK + K2V(x;)) 


a? Ky a? Ky 


= WU(x;,) 4 ay4c W(x;). 
y~ (ae- 5") 


Since by assumption the series }7 7° ) a? is convergent, (a7); and thus (az), converges 
to zero. Therefore, for t sufficiently large, the term (axe - OB) D(x,) has the 
sign of a,c¥(x,) and is non-negative, since a, > 0, U(x;) > 0, and c > 0. 
Thus, by the supermartingale convergence theorem 14.6, W(x;) converges and 
a (arc — a2 Ka) G(x) < oo. Since U(x;) converges and }7 7°) a? < oo, we have 


par on Ka U(x,) < co. But, since 77°, a; = 00, if the limit of U(x,) were non-zero, 
we would have }>7°) a,cU(x;,) = oo. This implies that the limit of U(x,) is zero, 


. . . . . a.s 
that is limy... ||k; — x*||2 — 0, which implies x, —> x*. m 


The following is another related result for which we do not present the full proof. 


Theorem 14.8 
Let H be a function mapping R® to RN, and (x;)ten, (We)ten, and (a4)ten be three 
sequences in RN with 


Vs €[1,N], xe41(s) = xe(s) + ae(s)[H(xz)(s) — xe(s) + wi(s)]- 


Let F; denote the entire history fort! < t, that is: Fy = {(xw)t<z, (ww )ir<t, (an )e<t} 
and assume that the following conditions are met: 


a dkj,Ko¢R: E [w?(s) |F] < Ki, + K2||xz||? for some norm || - ||; 
aE [wi | F| = 0; 
aVse el Sears a= oy ae a? < co; and 


= 7H is a||-||.0-contraction with fixed point x*. 
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Then, the sequence Xz converges almost surely to x*: 
x, —> x*. (14.21) 


The next sections present several learning algorithms for MDPs with an unknown 
model. 


14.5.2 TD(0) algorithm 


This section presents an algorithm, TD(0) algorithm, for evaluating a policy in the 
case where the environment model is unknown. The algorithm is based on Bellman’s 
linear equations giving the value of a policy 7 (see proposition 14.1): 


However, here the probability distribution according to which this last expectation 
is defined is not known. Instead, the TD(0) algorithm consists of 


= sampling a new state s’; and 
= updating the policy values according to the following, which justifies the name of 
the algorithm: 


V(s) — (L—a)V(s) + alr(s, m(s)) + 7V(s‘)] 
= V(s) + alr(s,7(s)) + yV(s’) — V(s)]. (14.22) 
temporal difference of V values 


Here, the parameter a is a function of the number of visits to the state s. 


The pseudocode of the algorithm is given above. The algorithm starts with an 
arbitrary policy value vector Vo. An initial state is returned by SELECTSTATE at 
the beginning of each epoch. Within each epoch, the iteration continues until a 
final state is found. Within each iteration, action m(s) is taken from the current 
state s following policy 7. The new state s’ reached and the reward r’ received are 
observed. The policy value of state s is then updated according to the rule (14.22) 
and current state set to be s’. 

The convergence of the algorithm can be proven using theorem 14.8. We will give 
instead the full proof of the convergence of the Q-learning algorithm, for which that 
of TD(0) can be viewed as a special case. 
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TD(0)() 
1 V+ Vop initialization. 
2 fort<Oto JT do 


3 8 <— SELECTSTATE() 

4 for each step of epoch t do 

5 r’ — REWARD(s, 7(s)) 

6 s’ — NEXTSTATE(7, 5) 

7 V(s) — (L—a)V(s) + aQ[7’ + 7V(s8')] 
8 ses! 

9 return V 


14.5.3 Q-learning algorithm 


This section presents an algorithm for estimating the optimal state-action value 
function Q* in the case of an unknown model. Note that the optimal policy or policy 
value can be straightforwardly derived from Q* via: m*(s) = argmax,e 4 Q*(s, a) 
and V*(s) = maxaea Q*(s,a). To simplify the presentation, we will assume a 
deterministic reward function. 

The Q-learning algorithm is based on the equations giving the optimal state- 
action value function Q* (14.4): 


Q*(s,a) = E[r(s,a)} +7 S- Pris’ | s,a]V*(s‘) 
s'ES 
= E{r(s, a) + ymax Q*(s,a)]. 


As for the policy values in the previous section, the distribution model is not known. 
Thus, the Q-learning algorithm consists of the following main steps: 


= sampling a new state s’; and 


= updating the policy values according to the following: 
Q(s,a) — aQ(s,a) + (1 —a)[r(s,a) + ymax Q(s’,a’)]. (1423) 
where the parameter a is a function of the number of visits to the state s. 


The algorithm can be viewed as a stochastic formulation of the value iteration 
algorithm presented in the previous section. The pseudocode is given above. Within 
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Q-LEARNING(7) 


1 Q<Q)_ Pinitialization, e.g., Qo = 0. 
2 fort<—0Oto T do 
3 s — SELECTSTATE() 
4 for each step of epoch t do 
5 a SELECTACTION(z,s)> policy m derived from Q, e.g., e-greedy. 
6 r’ — REWARD(s, a) 
¢ s’ — NEXTSTATE(s, a) 
8 Q(s,a) — Q(s, a) + alr’ + ymax, Q(s’, a’) — Q(s,a)| 
9 sos! 
10 return Q 


each epoch, an action is selected from the current state s using a policy a derived 
from Q. The choice of the policy m is arbitrary so long as it guarantees that every 
pair (s,a) is visited infinitely many times. The reward received and the state s’ 
observed are then used to update Q following (14.23). 


Theorem 14.9 

Consider a finite MDP. Assume that for all s € S anda € A, Yo) a1(s,4) = 00, 
and Y a7 (s,a) < co with a:(s,a) € [0,1]. Then, the Q-learning algorithm 
converges to the optimal value Q* (with probability one). 


Note that the conditions on a;(s,a) impose that each state-action pair is visited 
infinitely many times. 


Proof Let (Q:(s,@))¢>0 denote the sequence of state-action value functions at 
(s,a) € Sx A generated by the algorithm. By definition of the Q-learning updates, 


Qe+1(St, Gt) = Qe (st, a2) + a[r (se, ae) + a Qi(st41,a’) — Qi(st,ar)]. 


This can be rewritten as the following for alls € S anda € A: 


Qis1(s, a) = Q:(s,a) + a:(s, a) rs a)+y¥ max Q;(s’, a’)| — Q:(s, a| 


Bal 
s’~Pr[-|s,a] a’ 


max Q1(s’, «| , (14.24) 


+rar(s,a) [max Qus's0!)~ Bs | 


if we define a;(s,a) as 0 if (s,a) A (s1,a4) and az(sz,a¢) otherwise. Now, let Q; 
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denote the vector with components Q;(s,a), wz the vector whose s’th is 


max Q;(s’, a’)| ; 


w,(s!) = max Q,(s',a") — PL A / 


and H(Q;) the vector with components H(Q;)(a, a) defined by 


H(Q;)(z, a) =r(s,a) +47 max Qu(s',a’)| ; 


ee ae | a’ 
Then, in view of (14.24), 
V(s, a) € S x A, Qe+1(s, a) = Q:(s, a) — az(s, a) [H(Q:)(s, a) —_ Q:(s, a) +ywi(s)] c 


We now show that the hypotheses of theorem 14.8 hold for Q; and w;, which will 
imply the convergence of Q; to Q*. The conditions on a; hold by assumption. By 
definition of w:, E[w: | Fi] = 0. Also, for any s’ € S, 


|we(s’)| < max |Qz(s’, a’)| + 
a 


npr] 


s’~Pr{-|s,a] | a’ 
< 2max|max Qr(s',4’)| = 2 Quill. 


Thus, E [w?(s) | Fi} < 4||Q:||2.. Finally, H is a y-contraction for || - ||. since for 


any Q/,QY € RISIx 


Al, and (s,a) € S x A, we can write 


IA(Q2)(e,0) — H(Q})(e,2)] = fy, [imgxCao'sa!) ~ max Qu(s!sa!)] 
<7 et : max Q2(s’,a’) — max Qx(s',a’)| 


<7, B,, max [lQa(s',") ~ Qu(s',a’)l 
< ymax max [|Qo(s’, a’) — Qi(s’, @’)|] 


Since H is a contraction, it admits a fixed point Q*: H(Q*) = Q*. a 


The choice of the policy 7 according to which an action a is selected (line 5) is not 
specified by the algorithm and, as already indicated, the theorem guarantees the 
convergence of the algorithm for an arbitrary policy so long as it ensures that every 
pair (s,a) is visited infinitely many times. In practice, several natural choices are 
considered for 7. One possible choice is the policy determined by the state-action 
value at time t, Q;. Thus, the action selected from state s is argmax,¢ 4 Qi(s,a). But 
this choice typically does not guarantee that all actions are taken or that all states 
are visited. Instead, a standard choice in reinforcement learning is the so-called e- 
greedy policy, which consists of selecting with probability (1 — ) the greedy action 
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from state s, that is, argmax,<, Q:(s,a), and with probability € a random action 
from s, for some € € (0,1). Another possible choice is the so-called Boltzmann 
exploration, which, given the current state-action valueQ, epoch t € [0,7], and 
current state s, consists of selecting action a with the following probability: 


pi(als,Q) = ay? 


where 7; is the temperature. Tt must be defined so that t — 0 as t > oo, which 
ensures that for large values of t, the greedy action based on Q is selected. This is 
natural, since as t increases, we can expect Q to be close to the optimal function. 
On the other hand, % must be chosen so that it does not tend to 0 too fast to 
ensure that all actions are visited infinitely often. It can be chosen, for instance, as 
1/log(nz(s)), where n:(s) is the number of times s has been visited up to epoch t. 

Reinforcement learning algorithms include two components: a learning policy, 
which determines the action to take, and an update rule, which defines the new 
estimate of the optimal value function. For an off-policy algorithm, the update 
rule does not necessarily depend on the learning policy. Q-learning is an off-policy 
algorithm since its update rule (line 8 of the pseudocode) is based on the max 
operator and the comparison of all possible actions a’, thus it does not depend on 
the policy 7. In contrast, the algorithm presented in the next section, SARSA, is 
an on-policy algorithm. 


14.5.4 SARSA 


SARSA is also an algorithm for estimating the optimal state-value function in the 
case of an unknown model. The pseudocode is given in figure 14.7. The algorithm 
is in fact very similar to Q-learning, except that its update rule (line 9 of the 
pseudocode) is based on the action a’ selected by the learning policy. Thus, SARSA 
is an on-policy algorithm, and its convergence therefore crucially depends on the 
learning policy. In particular, the convergence of the algorithm requires, in addition 
to all actions being selected infinitely often, that the learning policy becomes greedy 
in the limit. The proof of the convergence of the algorithm is nevertheless close to 
that of Q-learning. 

The name of the algorithm derives from the sequence of instructions defining 
successively s, a, r’, s’, and a’, and the fact that the update to the function Q 
depends on the quintuple (s,a,7r’, s’, a). 
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SARSA(z) 
1 Q<Q)_ Pinitialization, e.g., Qo = 0. 
2 fort<—0Oto T do 
3 s — SELECTSTATE() 


4 a <— SELECTACTION(m(Q),s)> policy 7 derived from Q, e.g., - greedy. 
5 for each step of epoch t do 
6 r’ — REWARD(s, a) 
7 s’ — NEXTSTATE(s, a) 
8 a’ — SELECTACTION(7(Q),s’)> policy derived from Q, e.g., e greedy. 
9 Q(s,a) — Q(s,a) +.ax(s,a)[r’ +7Q(s',a") — Q(s,a)] 
10 sas! 
11 aa’ 
12 return Q 


Figure 14.7. The SARSA algorithm. 


14.5.5 TD()) algorithm 


Both TD(0) and Q-learning algorithms are only based on immediate rewards. The 
idea of TD(A) consists instead of using multiple steps ahead. Thus, for n > 1 steps, 
we would have the update 


V(s) — V(s) +a(R? — V(s)), 
where FR?’ is defined by 
R; = Te+1 + V't+2 + sa + ies ee + VV (St4n)- 


How should n be chosen? Instead of selecting a specific n, TD(A) is based on a 
geometric distribution over all rewards R?, that is, it uses RS = (1—A) 7", ARP 
instead of R? where \ € [0,1]. Thus, the main update becomes 


V(s) — V(s) + a(R} — V(s)). 


The pseudocode of the algorithm is given above. For A = 0, the algorithm coincides 
with TD(0). A = 1 corresponds to the total future reward. 
In the previous sections, we presented learning algorithms for an agent navigating 
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TD(A)() 
1 V+ Vo pe initialization. 
2 e«0O 
3 for t<—Oto T do 
4 8 — SELECTSTATE() 
5 for each step of epoch t do 
6 s’ — NEXTSTATE(7, 5) 
7 6 — r(s,m(s)) + AV(s’) — V(s) 
8 e(s) — Ae(s) +1 
9 for u € S do 
10 if u~As then 
11 e(u) — yAe(u) 
12 V(u) — V(u) + ade(u) 
13 ses! 
14 return V 


in an unknown environment. The scenario faced in many practical applications is 
more challenging; often, the information the agent receives about the environment 
is uncertain or unreliable. Such problems can be modeled as partially observable 
Markov decision processes (POMDPs). POMDPs are defined by augmenting the 
definition of MDPs with an observation probability distribution depending on the 
action taken, the state reached, and the observation. The presentation of their model 
and solution techniques are beyond the scope of this material. 


14.5.6 Large state space 


In some cases in practice, the number of states or actions to consider for the 
environment may be very large. For example, the number of states in the game 
of backgammon is estimated to be over 102°. Thus, the algorithms presented in 
the previous section can become computationally impractical for such applications. 
More importantly, generalization becomes extremely difficult. 

Suppose we wish to estimate the policy value V,(s) at each state s using 
experience obtained using policy 7. To cope with the case of large state spaces, 
we can map each state of the environment to R’ via a mapping ®: S — R%, with 
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N relatively small (NV ~ 200 has been used for backgammon) and approximate 
V,(s) by a function fy(s) parameterized by some vector w. For example, fw could 
be a linear function defined by fy(s) = w-®(s) for all s € S, or some more complex 
non-linear function of w. The problem then consists of approximating V, with fy 
and can be formulated as a regression problem. Note, however, that the empirical 
data available is not i.i.d. 

Suppose that at each time step ¢ the agent receives the exact policy value V,(s+). 
Then, if the family of functions fy is differentiable, a gradient descent method 
applied to the empirical squared loss can be used to sequentially update the weight 
vector W via: 


Wii = Wi — 0Vw, 5 (Valse) — for (sz)]? = Wr + alVx (se) a Fer Se)]V wide (se) 


It is worth mentioning, however, that for large action spaces, there are simple cases 
where the methods used do not converge and instead cycle. 


14.6 Chapter notes 


Reinforcement learning is an important area of machine learning with a large body 
of literature. This chapter presents only a brief introduction to this area. For a 
more detailed study, the reader could consult the book of Sutton and Barto [1998}, 
whose mathematical content is short, or those of Puterman [1994] and Bertsekas 
[1987], which discuss in more depth several aspects, as well as the more recent book 
of Szepesvari [2010]. The Ph.D. theses of Singh [1993] and Littman [1996] are also 
excellent sources. 

Some foundational work on MDPs and the introduction of the temporal difference 
(TD) methods are due to Sutton [1984]. Q-learning was introduced and analyzed 
by Watkins [1989], though it can be viewed as a special instance of TD methods. 
The first proof of the convergence of Q-learning was given by Watkins and Dayan 
[1992]. 

Many of the techniques used in reinforcement learning are closely related to those 
of stochastic approximation which originated with the work of Robbins and Monro 
[1951], followed by a series of results including Dvoretzky [1956], Schmetterer [1960], 
Kiefer and Wolfowitz [1952], and Kushner and Clark [1978]. For a recent survey of 
stochastic approximation, including a discussion of powerful proof techniques based 
on ODE (ordinary differential equations), see Kushner [2010] and the references 
therein. The connection with stochastic approximation was emphasized by Tsitsiklis 
[1994] and Jaakkola et al. [1994], who gave a related proof of the convergence of 
Q-learning. For the convergence rate of Q-learning, consult Even-Dar and Mansour 
[2003]. For recent results on the convergence of the policy iteration algorithm, see Ye 
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[2011], which shows that the algorithm is strongly polynomial for a fixed discount 
factor. 

Reinforcement learning has been successfully applied to a variety of problems 
including robot control, board games such as backgammon in which Tesauro’s TD- 
Gammon reached the level of a strong master [Tesauro, 1995] (see also chapter 
11 of Sutton and Barto [1998]), chess, elevator scheduling problems [Crites and 
Barto, 1996], telecommunications, inventory management, dynamic radio channel 
assignment [Singh and Bertsekas, 1997], and a number of other problems (see 
chapter 1 of Puterman [1994]). 


Conclusion 


We described a large variety of machine learning algorithms and techniques and 
discussed their theoretical foundations as well as their use and applications. While 
this is not a fully comprehensive presentation, it should nevertheless offer the reader 
some idea of the breadth of the field and its multiple connections with a variety of 
other domains, including statistics, information theory, optimization, game theory, 
and automata and formal language theory. 

The fundamental concepts, algorithms, and proof techniques we presented should 
supply the reader with the necessary tools for analyzing other learning algorithms, 
including variants of the algorithms analyzed in this book. They are also likely to 
be helpful for devising new algorithms or for studying new learning schemes. We 
strongly encourage the reader to explore both and more generally to seek enhanced 
solutions for all theoretical, algorithmic, and applied learning problems. 

The exercises included at the end of each chapter, as well as the full solutions we 
provide separately, should help the reader become more familiar with the techniques 
and concepts described. Some of them could also serve as a starting point for 
research work and the investigation of new questions. 

Many of the algorithms we presented as well as their variants can be directly 
used in applications to derive effective solutions to real-world learning problems. 
Our detailed description of the algorithms and discussion should help with their 
implementation or their adaptation to other learning scenarios. 

Machine learning is a relatively recent field and yet probably one of the most 
active ones in computer science. Given the wide accessibility of digitized data and 
its many applications, we can expect it to continue to grow at a very fast pace 
over the next few decades. Learning problems of different nature, some arising 
due to the substantial increase of the scale of the data, which already requires 
processing billions of records in some applications, others related to the introduction 
of completely new learning frameworks, are likely to pose new research challenges 
and require novel algorithmic solutions. In all cases, learning theory, algorithms, 
and applications form an exciting area of computer science and mathematics, which 
we hope this book could at least partly communicate. 


Appendix A’ Linear Algebra Review 


In this appendix, we introduce some basic notions of linear algebra relevant to the 
material presented in this book. This appendix does not represent an exhaustive 
tutorial, and it is assumed that the reader has some prior knowledge of the subject. 


A.1 Vectors and norms 
We will denote by H a vector space whose dimension may be infinite. 
A.1.1 Norms 


Definition A.1 

A mapping ®: H — R is said to define a norm on H if it verifies the following 
axioms: 

= definiteness: Vx € H, ®(x) =0=x=0; 

= homogeneity: Vx € H, Va € R, ®(ax) = |a|®(x); 

= triangle inequality: Vx, y € H, ®(x+y) < ®(x) + @(y). 

A norm is typically denoted by || - ||. Examples of vector norms are the absolute 


value on R and the Euclidean (or Lz) norm on R%. More generally, for any p > 1 
the L, norm is defined on RY as 


N _ al “ie 1/p 
Wx ER, Ilo = (Sol?) . (A.1) 


The £1, £2, and L. norms are the some of the most commonly used norms, where 
|Xlloo = Maxjef1,~) £;. Two norms || - || and || - ||’ are said to be equivalent iff there 
exists a, 3 > 0 such that for all x € H, 


al|x|| < |[xII' < Gl]xl]. (A.2) 
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The following general inequalities relating these norms can be proven straightfor- 


wardly: 
IIxll2 < [xl < VN [xll2 (A.3) 
IIXlloo S [IXllo < VN [xlloo (A.4) 
IIXI]oo < |lXI]1 < NX] 00. (A.5) 


The second inequality of the first line can be shown using the Cauchy-Schwarz 
inequality presented later while the other inequalities are clear. These inequalities 
show the equivalence of these three norms. More generally, all norms on a finite- 
dimensional space are equivalent. The following additional properties hold for the 
Lx. norm: for all x € H, 


Vp > 1, [xlloo < [|xllp < N1/? [[xllo0 (A.6) 
li = ||x|loo. 
pli lll = [hl (A.7) 


The inequalities of the first line are straightforward and imply the limit property of 
the second line. 

We will often consider a Hilbert space, that is a vector space equipped with an 
inner product (-,-) and that is complete (all Cauchy sequences are convergent). The 
inner product induces a norm defined as follows: 


Yx EH, ||x|le = V (x,x). (A.8) 
A.1.2 Dual norms 


Definition A.2 


Let ||-|| be a norm on RN. Then, the dual norm || «||, associated to || -|| is the norm 
defined by 
Vy €H, llyll. = fr | (y, x) |. (A.9) 
x||=1 


For any p,q > 1 that are conjugate that is such that . + : = 1, the L, and L, 
norms are dual norms of each other. In particular, the dual norm of Ly is the Dy 
norm, and the dual norm of the LZ, norm is the L,, norm. 


Proposition A.1 Hélder’s inequality 
Let p,q => 1 be conjugate: : + : =1. Then, for all x,y € RY, 


(x,y) | < [xllllylla, (A.10) 


with equality when |y:| = |xi|?~+ for all i € [1, N]. 
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Proof The statement holds trivially for x = 0 or y = O; thus, we can assume 
x #O0Oand y £0. Let a,b > 0. By the concavity of log (see definition B.5), we can 
write 


1 1 1 1 
log (Cor + w) > — log(a”) + — log(b?) = log(a) + log(b) = log(ab). 
p qd Pp q 
Taking the exponential of the left- and right-hand sides gives 
1 1 
—aP + —b7 > ab, 
Pp qd 
which is known as Young’s inequality. Using this inequality with a = |x,;|/||x||, and 
b= |y;|/|lyllq for 7 € [1, N] and summing up gives 


1 7 1 1 
ly? 11 _, 


N 
ee |n5y5| 2 1 ||x||? = 
qiiyllt4 p 4 


IIxllpllylla ~ p IbxllP 


Since | (x,y) | < pee |x;y;|, the inequality claim follows. The equality case can be 
verified straightforwardly. 


Taking p = q = 2 immediately yields the following result known as the Cauchy- 
Schwarz inequality . 


Corollary A.1 Cauchy-Schwarz inequality 
For all x,y € RY, 


|(x,y) | < [Ixllellylle, (A.11) 


with equality iff x and y are collinear. 


Let H be the hyperplane in RY whose equation is given by 
w-x+b=0, 


for some normal vector w € R% and offset b € R. Let d,(x,H) denote the distance 
of x to the hyperplane H, that is, 


d(x, H) = at Ix’ — X|lp- (A.12) 
Then, the following identity holds for all p > 1: 


(A.13) 


where q is the conjugate of p: - + = 1. (A.13) can be shown by a straightforward 
application of the results of appendix B to the constrained optimization problem 
(A.12). 
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A.2 Matrices 


For a matrix M € R™*” with m rows and n columns, we denote by M,,; its ijth 
entry, for all i € [l,m] and j € [l,n]. For any m > 1, we denote by I, the m- 
dimensional identity matrix, and refer to it as I when the dimension is clear from 
the context. 

The transpose of M is denoted by M' and defined by (M');; = Mj; for all (i, j). 
For any two matrices M € R™*” and N € R™*?, (MN)' = N'M'. M is said to 
be symmetric iff M;; = Mj; for all (i,j), that is, iff M=M'. 

The trace of a square matrix M is denoted by Tr[M] and defined as Tr[M] = 
am M,,;. For any two matrices M € R™*” and N € R”*”, the following identity 
holds: Tr[MIN] = Tr[NM]. More generally, the following cyclic property holds with 
the appropriate dimensions for the matrices M, N, and P: 


Tr[MNP] = Tr[PMN] = Tr[NPM]. (A.14) 
The inverse of a square matrix M, which exists when M has full rank, is denoted 
by M7! and is the unique matrix satisfying MM~! = M~-!M =I. 


A.2.1 Matrix norms 


A matrix norm is a norm defined over R™*" where m and n are the dimensions 
of the matrices considered. Many matrix norms, including those discussed below, 
satisfy the following submultiplicative property: 


|| MIN|| < ||MI||N]I- (A.15) 


The matrix norm induced by the vector norm || - ||, or the operator norm induced 
by that norm is also denoted by || - ||, and defined by 


|M||p = sup ||[Mx|lp. (A.16) 


Il<IIp<1 


The norm induced for p = 2 is known as the spectral norm, which equals the largest 
singular value of M (see section A.2.2), or the square-root of the largest eigenvalue 


of M™M: 
|[M]|2 = o1(M) = 4/Amax(M™M). (A.17) 
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Not all matrix norms are induced by vector norms. The Frobenius norm denoted 


by || - || is the most notable of such norms and is defined by: 
mn 1/2 
_ 2 
IMle=(SooMz) 
i=1 j=1 


The Frobenius norm can be interpreted as the Lz norm of a vector when treating 
M as a vector of size mn. It also coincides with the norm induced by the Frobenius 
product, which is the inner product defined over for all M,N € R™*"” by 


(M,N) - = Tr[M'N]. (A.18) 


This relates the Frobenius norm to the singular values of M: 
|M| = Tr[M™M] = 57 o:(M)’, 
i=1 
where r = rank(M). The second equality follows from properties of SPSD matrices 
(see section A.2.3). 
For any j € [1,n], let M; denote the jth column of M, that is M = [M, ---M,]. 
Then, for any p,r > 1, the L,, group norm of M is defined by 


n 1/r 
[Mllon = (> IM.) | 
j=l 


One of the most commonly used group norms is the Lz; norm defined by 


Milo. = S— |IMillo- 


i=l 


A.2.2 Singular value decomposition 


The compact singular value decomposition (SVD) of M, with r = rank(M) < 
min(m,n), can be written as follows: 


M = Uy=muVj,- 


The r x r matrix Nyy = diag(oi,...,0,) is diagonal and contains the non-zero 
singular values of M sorted in decreasing order, that is oj >... > o, > 0. 
Uy € R™*" and Vy € R”*" have orthonormal columns that contain the left and 
right singular vectors of M corresponding to the sorted singular values. U, € R™** 
are the top k < r left singular vectors of M. 

The orthogonal projection onto the span of U; can be written as Py, = U,U;, 


where Py, is SPSD and idempotent, i-e., Pi, = Py,. Moreover, the orthogonal pro- 
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jection onto the subspace orthogonal to U,, is defined as Py, ,. Similar definitions, 
Le., Vx, Py,,Pv,,1, hold for the right singular vectors. 

The generalized inverse, or Moore-Penrose pseudo-inverse of a matrix M is 
denoted by Mi? and defined by 


Mt =Uyx=Vi, (A.19) 


where Si = diag(o;',...,071). For any square m x m matrix M with full rank, 
i.e., r =m, the pseudo-inverse coincides with the matrix inverse: Mt = M71. 


A.2.3. Symmetric positive semidefinite (SPSD) matrices 


Definition A.3 


A symmetric matrix M € R™*™ 


is said to be positive semidefinite iff 
x'Mx >0 (A.20) 
or allx € R™. M is said to be positive definite if the inequality is strict. 
y 


Kernel matrices (see chapter 5) and orthogonal projection matrices are two examples 
of SPSD matrices. It is straightforward to show that a matrix M is SPSD iff its 
eigenvalues are all non-negative. Furthermore, the following properties hold for any 
SPSD matrix M: 


= M admits a decomposition M = X!'X for some matrix X and the Cholesky 
decomposition provides one such decomposition in which X is an upper triangular 
matrix. 


® The left and right singular vectors of M are the same and the SVD of M is also 
its eigenvalue decomposition. 


# The SVD of an arbitrary matrix X = Ux» xVi defines the SVD of two related 
SPSD matrices: the left singular vectors (Ux) are the left singular vectors of XX", 
the right singular vectors (Vx) are the right singular vectors of X'X and the non- 


zero singular values of X are the square roots of the non-zero singular values of 
XX! and X'X. 


= The trace of M is the sum of its singular values, i.e., Tr[M] = 57;_, 0i(M), where 
rank(M) = r. 
= The top singular vector of M, u,, maximizes the Rayleigh quotient, which is 
defined as 

x! Mx 


Tx ¢ 


r(x, M) = 


x 


In other words, u; = argmax,, r(x, M) and r(u,M) = o1(M). Similarly, if M’ = 


A.2 Matrices S47 


Py,,1M, that is, the projection of M onto the subspace orthogonal to U;, then 
Uji1 = argmax,, r(x, M’), where uj+, is the (i+ 1)st singular vector of M. 


Appendix B- Convex Optimization 


In this appendix, we introduce the main definitions and results of convex optimiza- 
tion needed for the analysis of the learning algorithms presented in this book. 


B.1 Differentiation and unconstrained optimization 


We start with some basic definitions for differentiation needed to present Fermat’s 
theorem and to describe some properties of convex functions. 


Definition B.1 Gradient 
Let f: X CRN —R be a differentiable function. Then, the gradient of f atx € X 
is the vector in RN denoted by V f(x) and defined by 


V(x) = 


Definition B.2 Hessian 

Let f: ¥ CRN —R be a twice differentiable function. Then, the Hessian of f at 
x € X is the matrix in RN*N denoted by V?f(x) and defined by 

os 


OL coe 
OX;, Xj 1<i,j<N 


V?F(x) = | 
Next, we present a classic result for unconstrained optimization. 


Theorem B.1 Fermat’s theorem 
Let f: ¥ CRN —R be a differentiable function. If f admits a local extremum at 
x* € X, then V f(x*) =0, that is, x* is a stationary point. 
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Figure B.1_ Examples of a convex (left) and a concave (right) functions. Note that 
any line segment drawn between two points on the convex function lies entirely 
above the graph of the function while any line segment drawn between two points 
on the concave function lies entirely below the graph of the function. 


B.2  Convexity 


This section introduces the notions of convex sets and convex functions. Convex 
functions play an important role in the design and analysis of learning algorithms, 
in part because a local minimum of a convex function is necessarily also a global 
minimum. Thus, the properties of a learning hypothesis that is a local minimum 
of a convex optimization are often well understood, while for some non-convex 
optimization problems, there may be a very large number of local minima for which 
no clear characterization can be given. 


Definition B.3 Convex set 
A set X CR is said to be convex if for any two points x,y € X the segment [x,y] 
lies in X, that is 


{ax+(l-a)y:0<a<1} CX. 


Definition B.4 Convex hull 
The convex hull conv(¥) of a set of points X C RN is the minimal convex set 
containing X and can be equivalently defined as follows: 


m 


conv(¥) = { }>aixi: m 2 1, Vie [1,m],x; € ¥,a; > 0, aj =1}. (B.1) 
i=l i=1 


Let Epi f denote the epigraph of function f: * — R, that is the set of points lying 
above its graph: {(a,y): 7 € X,y> f(a)}. 
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f(x) + Vf(x)-(y — 2) 
Figure B.2_ Illustration of the first-order property satisfied by all convex functions. 


Definition B.5 Convex function 
Let X be a convex set. A function f: & — R is said to be convex iff Epif is a 
convex set, or, equivalently, if for allx,y € X and a€ (0,1), 


f(ax+ (1—a)y) < af(x) +(-a)f(y). (B.2) 


f is said to be strictly convex if inequality (B.2) is strict for all x,y € XY where 
x # y and a € (0,1). f is said to be (strictly) concave when —f is (strictly) 
convex. Figure B.1 shows simple examples of a convex and concave functions. 
Convex functions can also be characterized in terms of their first- or second-order 
differential. 


Theorem B.2 
Let f be a differentiable function, then f is convex if and only if dom(f) is convex 
and the following inequalities hold: 


Vx,y €dom(f), f(y) — f(x) 2 VF) -(y -x). (B.3) 


The property (B.3) is illustrated by figure B.2: for a convex function, the hyperplane 
tangent at x is always below the graph. 


Theorem B.3 
Let f be a twice differentiable function, then f is convex iff dom(f) is convex and 
its Hessian is positive semidefinite: 


Vx € dom(f), V?f(x) = 0. 


Recall that a symmetric matrix is positive semidefinite if all of its eigenvalues are 
non-negative. Further, note that when f is scalar, this theorem states that f is 
convex if and only if its second derivative is always non-negative, that is, for all 
x €dom(f), f”(x) > 0. 


Example B.1 Linear functions 
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Any linear function f is both convex and concave, since equation (B.2) holds with 
equality for both f and —f by the definition of linearity. 


Example B.2 Quadratic function 
The function f: 2+ 2? defined over R is convex since it is twice differentiable and 
for alla ER, f"(#) =2>0. 


Example B.3 Norms 
Any norm ||-|| defined over a convex set ¥ is convex since by the triangle inequality 
and homogeneity property of the norm, for all a € [0,1],x,y € ¥, we can write 


lax + (1 — a)yl] < llax|| + || — a)yl] = allx|| + 1 — )IIyI- 


Example B.4 Maximum function 
The max function defined for all x € R', by x > maxje[1,N] Xj is convex. For all 
a € [0,1],x,y € RY, by the subadditivity of max, we can write 


max(ax;-+(1—a)y;) < max(ox;)-+max((1—a)y;) a omanx(x;)+(1—a) max(y;) ; 


One useful approach for proving convexity or concavity of functions is to make 
use of composition rules. For simplicity of presentation, we will assume twice 
differentiability, although the results can also be proven without this assumption. 


Lemma B.1 Composition of convex/concave functions 

Assume h: R — R andg: RX — R are twice differentiable functions and for all 
x € RY, define f(x) = h(g(x)). Then the following implications are valid: 

= hf is convex and non-decreasing, and g is conver => f is convex. 

= h is convex and non-increasing, and g is concave = > f is conver. 

= fh is concave and non-decreasing, and g is concave => f is concave. 

= fh is concave and non-increasing, and g is conver => f is concave. 

Proof We restrict ourselves to n = 1, since it suffices to prove convexity (concay- 


ity) along all arbitrary lines that intersect the domain. Now, consider the second 
derivative of f: 


f" (x) = h"(g(x))g"(a)? + h'(g(a))9"(2). (B.4) 


Note that if h is convex and non-decreasing, we have h” > 0 and h’ > 0. 
Furthermore, if g is convex we also have g” > 0, and it follows that f’(a) > 0, 
which proves the first statement. The remainder of the statements are proven in a 
similar manner. 


Example B.5 Composition of functions 
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The previous lemma can be used to immediately prove the convexity or concavity 
of the following composed functions: 


= If f :R‘ — R is convex, then exp(f) is convex. 


= Any squared norm || - ||? is convex. 


= For all x € R% the function x + log(3F xj) is concave. 


The following is a useful inequality applied in a variety of contexts. It is in fact a 
quasi-direct consequence of the definition of convexity. 


Theorem B.4 Jensen’s inequality 

Let X be a random variable taking values in a non-empty convex set C C RN with a 
finite expectation E[X], and f a measurable convex function defined over C. Then, 
E[X] is in C, E[f(X)] ts finite, and the following inequality holds: 


F(E[X]) < E[f(X)]. 


Proof We give a sketch of the proof, which essentially follows from the definition 
of convexity. Note that for any finite set of elements 71,...,2, in C and any positive 
reals Q1,...,@n such that )7>;"., a; = 1, we have 


f( Soe) < Srais(e) . 


This follows straightforwardly by induction from the definition of convexity. Since 
the a;s can be interpreted as probabilities, this immediately proves the inequality 
for any distribution with a finite support defined by a = (a4,...,Qn): 

f(BLX)) < BLA(X)]. 


a 
Extending this to arbitrary distributions can be shown via the continuity of f on 


any open set, which is guaranteed by the convexity of f, and the weak density of 
distributions with finite support in the family of all probability measures. 


B.3 Constrained optimization 


We now define a general constrained optimization problem and the specific proper- 
ties associated to convex constrained optimization problems. 
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Definition B.6 Constrained optimization problem 
Let X CRN and f,g;:: ¥ —R, for alli € [1,m]. Then, a constrained optimization 
problem has the form: 
min f(x) 
subject to: g(x) <0, Vie {1,...,m}. 


This general formulation does not make any convexity assumptions and can be 
augmented with equality constraints. It is referred to as the primal problem in 
contrast with a related problem introduced later. We will denote by p* the optimal 
value of the objective. 

For any x € X, we will denote by g(x) the vector (91(z),...,9m(x))'. Thus, the 
constraints can be written as g(x) < 0. To any constrained optimization problem, 
we can associate a Lagrange function that plays an important in the analysis of the 
problem and its relationship with another related optimization problem. 


Definition B.7 Lagrangian 
The Lagrange function or the Lagrangian associated to the general constrained 
optimization problem defined in (B.6) is the function defined over X x Ry by: 


Vxe¥,Wa>0, L(x,a) = f(x) + >> aigi(x), 
4=1 


where the variables a; are known as the Lagrange or dual variables with a = 


(Gisnvestig 


Any equality constraint of the form g(x) = 0 for a function g can be equivalently 
expressed by two inequalities: —g(x) < 0 and +g(x) < 0. Let a_ > 0 be the 
Lagrange variable associated to the first constraint and a; > 0 the one associated 
to the second constraint. The sum of the terms corresponding to these constraints 
in the definition of the Lagrange function can therefore be written as ag(x) with 
a = (a;—a_). Thus, in general, for an equality constraint g(x) = 0 the Lagrangian 
is augmented with a term ag(x) but with a € R not constrained to be non-negative. 
Note that in the case of a convex optimization problem , equality constraints g(x) 
are required to be affine since both g(x) and —g(x) are required to be convex. 


Definition B.8 Dual function 
The (Lagrange) dual function associated to the constrained optimization problem is 
defined by 


Va > 0, F(a) = inf £(x,a) = inf (f(x) + > agi(%) : (B.5) 
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Note that F is always concave, since the Lagrangian is linear with respect to a@ and 
since the infimum preserves concavity. We further observe that 


YVa>0, F(a) <p", (B.6) 


since for any feasible x, f(x) + )>;"., aigi(x) < f(x). The dual function naturally 
leads to the following optimization problem. 


Definition B.9 Dual problem 
The dual (optimization) problem associated to the constrained optimization problem 
1s 


max F(a) 
subject to: a>0. 


The dual problem is always a convex optimization problem (as a maximization of a 
concave problem). Let d* denote optimal value. By (B.6), the following inequality 
always holds: 


d <p (weak duality). 
The difference (p* — d*) is known as the duality gap. The equality case 
d* = p* (strong duality) 


does not hold in general. However, strong duality does hold when convex problems 
satisfy a constraint qualification. We will denote by int(4’) the interior of the set 
Xx. 


Definition B.10 Strong constraint qualification 
Assume that int(V¥) # @. Then, the strong constraint qualification or Slater’s 
condition is defined as 


4X € int(): g(x) <0. (B.7) 


A function h: X& — R is said to be affine if it can be defined for all x € ¥ by 
h(x) =w-x+b, for some w € RN and bER. 


Definition B.11 Weak constraint qualification 
Assume that int(v) #0. Then, the weak constraint qualification or weak Slater’s 
condition is defined as 


4x € int(X): Vi € [1, m], (9:(&) < 0) V (gi(X) = 0A gj affine). (B.8) 
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We next present sufficient and necessary conditions for solutions to constrained 
optimization problems, based on the saddle point of the Lagrangian and Slater’s 
condition. 


Theorem B.5 Saddle point — sufficient condition 
Let P be a constrained optimization problem over X = RN. If (x*,a*) is a saddle 
point of the associated Lagrangian, that is, 


Vx ERY, Va>0, L(x*,a) < L(x*,a*) < L(x,a*), (B.9) 
then (x*,a*) is a solution of the problem P. 
Proof By the first inequality, the following holds: 
Va > 0, £(x*,a) < £(x*,a*) > Va > 0,a- g(x*) < a* - g(x*) 
=> g(x*) <0A a*-g(x*) =0, (B.10) 


where g(x*) < 0 in (B.10) follows by letting a — +00 and a* - g(x*) = 0 follows 
by letting a — 0. In view of (B.10), the second inequality in (B.9) gives, 


Vx, £(x", a") < L(x, a") = Vx, f(x") < f(x) + a: g(x). 
Thus, for all x satisfying the constraints, that is g(x) < 0, we have 
f(x") < f(x), 
which completes the proof. # 


Theorem B.6 Saddle point — necessary condition 

Assume that f and g;, i € {1,m], are convex functions and that Slater’s condition 
holds. Then, if x is a solution of the constrained optimization problem, then there 
exists a@ > 0 such that (x,a@) is a saddle point of the Lagrangian. 


Theorem B.7 Saddle point — necessary condition 

Assume that f and g;, i € [l,m], are convex differentiable functions and that the 
weak Slater’s condition holds. If x is a solution of the constrained optimization 
problem, then there exists a > 0 such that (x, @) is a saddle point of the Lagrangian. 


We conclude with a theorem providing necessary and sufficient optimality con- 
ditions when the problem is convex, the objective function differentiable, and the 
constraints qualified. 


Theorem B.8 Karush-Kuhn-Tucker’s theorem 
Assume that f,gj: ¥ > R,Vi € {1,...,m} are convex and differentiable and that 
the constraints are qualified. Then X is a solution of the constrained program if and 
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if only there exists @ > 0 such that, 


VxL(%, @) = Vxf (KX) + @- Vxg(X) =0 (B.11) 
VaL(%,@) =o) <0 (B.12) 
@-9(X) = )_ aig(Xi) = 0 (B.13) 


The conditions B.11—B.13 are known as the KKT conditions. Note that the last two 
KKT conditions are equivalent to 


g(X) <OA (Vie {1,...,m}, Qjg;(K) = 0). (B.14) 
These equalities are known as complementarity conditions. 


Proof For the forward direction, since the constraints are qualified, if 7 is a 
solution, then there exists @ such that the (Z, @) is a saddle point of the Lagrangian 
and all three conditions are satisfied (the first condition follows by definition of a 
saddle point, and the second two conditions follow from (B.10)). 

In the opposite direction, if the conditions are met, then for any x such that 
g(x) <0, we can write 


i) = f(x) 2 Vaso (=x) (convexity of f) 
> — 0 aiVxgi(X) - (x — x) (first condition) 
i=1 
> So ai [gi(x) — gi(X)] (convexity of g;s) 
i=1 
> S > aigi(x) > 0, (third and second condition) 
i=1 


which shows that f(Z) is the minimum of f over the set of points satisfying the 
constraints. 


B.4 Chapter notes 


The results presented in this appendix are based on three main theorems: theo- 
rem B.1 due to Fermat (1629); theorem B.5 due to Lagrange (1797), and theo- 
rem B.8 due to Karush [1939] and Kuhn and Tucker [1951]. 

For a more extensive material on convex optimization, we strongly recommend 


the book of Boyd and Vandenberghe [2004]. 


Appendix C Probability Review 


In this appendix, we give a brief review of some basic notions of probability and 
will also define the notation that is used throughout the textbook. 


C.1 Probability 


A probability space is a model based on three components: a sample space, an events 
set, and a probability distribution: 


= sample space Q: Q is the set of all elementary events or outcomes possible in a 
trial, for example, each of the six outcomes in {1,...,6} when casting a die. 


= events set F: F is a o-algebra, that is a set of subsets of Q containing 2 that 
is closed under complementation and countable union (therefore also countable 
intersection). An example of an event may be “the die lands on an odd number”. 


= probability distribution: Pr is a mapping from the set of all events F to [0,1] such 


that Pr[Q] = 1 and, for all mutually exclusive events A1,...,An, 
PriAp iW A, |= S_ Pri Ay). 
w=1 


The discrete probability distribution associated with a fair die can be defined by 
Pr[A;] = 1/6 for i € {1...6}, where A; is the event that the die lands on value 2. 


C.2 Random variables 


Definition C.1 Random variables 
A random variable X is a function X:Q— R that is measurable, that is such that 
for any interval I, the subset of the sample space {w € Q: X(w) € I} is an event. 


The probability mass function of a discrete random variable X is defined as the 
function « + Pr[X = a]. The joint probability mass function of discrete random 
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Figure C.1 Approximation of the binomial distribution (in red) by a normal 
distribution (in blue). 


variables X and Y is defined as the function (x,y) - PrlX =xAY = yj. 

A probability distribution is said to be absolutely continuous when it admits a 
probability density function, that is a function f associated to a real-valued random 
variable X that satisfies for all a,b € R 


Pria< X <b)= [ f(a)da. (C.1) 


Definition C.2 Binomial distribution 
A random variable X is said to follow a binomial distribution B(n,p) with n € N 
and p € [0,1] if for any k € {0,1,...,n}, 


Definition C.3 Normal distribution 
A random variable X is said to follow a normal (or Gaussian) distribution N (1, 07) 
with we R anda > 0 if its probability density function is given by, 


fe) = eeew (-E), 


The standard normal distribution N(0,1) is the normal distribution with zero mean 


and unit variance. 


The normal distribution is often used to approximate a binomial distribution. 
Figure C.1 illustrates that approximation. 


Definition C.4 Laplace distribution 
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A random variable X is said to follow a Laplace distribution with location parameter 
LER and scale parameter b > 0 if its probability density function is given by, 


f(e)= 50° ( ee). 


Definition C.5 Poisson distribution 
A random variable X is said to follow a Poisson distribution with A > 0 if for any 
KEN, 


re 
The definition of the following family of distributions uses the notion of indepen- 
dence of random variables defined in the next section. 


Definition C.6 y?-squared distribution 

The x?-distribution (or chi-squared distribution) with k degrees of freedom is the 
distribution of the sum of the squares of k independent random variables, each 
following a standard normal distribution. 


C.3 Conditional probability and independence 


Definition C.7 Conditional probability 
The conditional probability of event A given event B is defined by 


Pr[An B| 


Pr{A | B] = PriB) (C.2) 
when Pr[B] 4 0. 
Definition C.8 Independence 
Two events A and B are said to be independent if 

Pr[AN B] = Pr[A] Pr[B]. (C.3) 


Equivalently, A and B are independent iff Pr[A | B] = Pr[A] when Pr[B] 4 0. 


A sequence of random variables is said to be independently and identically distributed 
(i.i.d.) when the random variables are mutually independent and follow the same 
distribution. 

The following are basic probability formulae related to the notion of conditional 
probability. They hold for any events A, B, and Aj,...,An, with the additional 
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constraint Pr[B] 4 0 needed for the Bayes formula to be well defined: 

Pr[A U B] = Pr[A] + Pr[B] — Pr[An B] (sum rule) — (C.4) 

Pr|\. J Aj] < S- Pr[Aj] (union bound) — (C.5) 
i=1 i=1 

Pr[{A | B] = ED Ae] (Bayes formula) — (C.6) 

Pr[B] 

n n-1 

Pr[() Ai] = Pr[Ai] Pr[A2 | Ai]---Pr[An | () Ai (chain rule). (C.7) 
i=l i=l 


The sum rule follows immediately from the decomposition of AU B as the union of 
the disjoint sets A and (B— ANB). The union bound is a direct consequence of the 
sum rule. The Bayes formula follows immediately from the definition of conditional 
probability and the observation that: Pr[A|B] Pr[B] = Pr[B|A] Pr[A] = Pr[ANn B]. 
Similarly, the chain rule follows the observation that Pr[A] Pr[A2|A1] = Pr[A,N Ag]; 
using the same argument shows recursively that the product of the first k terms of 
the right-hand side equals Print, Aj]. 

Finally, assume that Q = A; U A. U...U A, with A;M A; = 0 for i F j, ie., the 
Ajs are mutually disjoint. Then, the following formula is valid for any event B: 


Pr[B] = yy Pr[B | Aj] Pr[Aj] (theorem of total probability). (C.8) 
i=1 
This follows the observation that Pr[B | A;] Pr[A;] = Pr[ Bn Aj] by definition of the 
conditional probability and the fact that the events BM A; are mutually disjoint. 


Example C.1 Application of the Bayes formula 

Let H be a set of hypotheses. The maximum a posteriori (MAP) principle consists 
of selecting the hypothesis h € H that is the most probable given the observation 
O. Thus, by the Bayes formula, it is given by 


h= argmax Pr[h|O] = argmax Pr[O|h] Prlh] 
heH 


a P10] = sa a Pr[O|h] Pr[h]. (C.9) 


Now, suppose we need to determine if a patient has a rare disease, given a laboratory 
test of that patient. The hypothesis set is reduced to the two outcomes: d (disease) 
and nd (no disease), thus H = {d,nd}. The laboratory test is either pos (positive) 
or neg (negative), thus O = {pos, neg}. 

Suppose that the disease is rare, say Pr[d] = .005 and that the laboratory is 
relatively accurate: Pr[pos|d] = .98, and Pri[neg|nd| = .95. Then, if the test is 
positive, what should be the diagnosis? We can compute the right-hand side of 
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(C.9) for both hypotheses to determine h: 


Pr{pos|d] Pr[d] = .98 x .005 = .0049 
Pr[pos|nd] Pr[nd] = (1 — .95) x .(1 — .005) = .04975 > .0049. 


Thus, in this case, the MAP prediction is h = nd: with the values indicated, a 
patient with a positive test result is nonetheless more likely not to have the disease! 


C.4 Expectation, Markov’s inequality, and moment-generating 
function 


Definition C.9 Expectation 
The expectation or mean of a random variable X is denoted by E[X] and defined 


by 

=) ePix =), (C.10) 
When X follows a probability distribution D, we will also write E,. p[] instead of 
E[X] to explicitly indicate the distribution. A fundamental property of expectation, 


which is straightforward to verify using its definition, is that it is lear, that is, for 
any two random variables X and Y and any a,b € R, the following holds: 


E[aX + bY] = aE[X] + bE[Y]. (C.11) 


Furthermore, when X and Y are independent random variables, then the following 
identity holds: 


E[XY] = E[X]E[Y]. (C.12) 
Indeed, by definition of expectation and of independence, we can write 


=o oyPir[X=2AY=y|= yee x] Pr[Y = y] 


eiits =a) Eer=a) 


where in the last step we used Fubini’s theorem . The following provides a simple 
bound for a non-negative random variable in terms of its expectation, known as 
Markov’s inequality. 
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Theorem C.1 Markov’s inequality 
Let X be a non-negative random variable with E[X] < co. Then for all t > 0, 


1 
Pr [X > tE[X]] < : (C.13) 
Proof The proof steps are as follows: 
Pr[X >tE[X]]= S> Pr[X =a] (by definition) 
v>tE[X] 
x x 
< Pr[X = (usi >1) 
= Del 7 RIX] ae eae 
x>t B[X] 
x 
< => i S l 
< x Pr[X = a] TEL] (extending non-negative sum) 
x 1 
=E Es = (linearity of expectation). 


This concludes the proof. 


The following function based on the notion of expectation is often useful in the 
analysis of the properties of a distribution. 


Definition C.10 Moment-generating function 
The moment-generating function of a random variable X is the function t + Efe'*| 
defined over the set of t € R for which the expectation is finite. 


We will present in the next chapter a general bound on the moment-generating 
function of a zero-mean bounded random variable (Lemma D.1). Here, we illustrate 
its computation in the case of a y?-distribution. 


Example C.2 Moment-generating function of \?-distribution 
Let X be a random variable following a y?-squared distribution with k degrees of 
freedom. We can write X = ae X? where the X;s are independent and follow a 
standard normal distribution. 

Let t < 1/2. By the i.i.d. assumption about the variables X;, we can write 


k 


k 
Ele’*] = E [TTe***| — ][£ [et**] =E fe)". 


i=1 


By definition of the standard normal distribution, we have 


E| tXt] 1 ea ta? a 1 = (1-21) =F g 
e€ = e € 2? dz = ——= e€ 2 dx 
V2 Joo V2 J—oo 


2 
1 7 ga 


~ Van too VI —2t 


du = (1 —2¢)?, 
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where we used the change of variable u = V1 — 2tx. In view of that, the moment- 
generating function of the y?-distribution is given by 


Vt < 1/2, Ele**] = (1 — 2t)?. (C.14) 


C.5 Variance and Chebyshev’s inequality 


Definition C.11 Variance — Standard deviation 
The variance of a random variable X is denoted by Var[X]| and defined by 


Var|X] = E[(X — E[X])?]. (C.15) 
The standard deviation of a random variable X is denoted by ax and defined by 


ox = VVar|[X]. (C.16) 


For any random variable X and any a € R, the following basic properties hold for 
the variance, which can be proven straightforwardly: 

Var[X] = E[X?] — E[X]? (C.17) 

Var[a.X] = a? Var[X]. (C.18) 


Furthermore, when X and Y are independent , then 
Var[X + Y] = Var[X] + Var[Y]. (C.19) 


Indeed, using the linearity of expectation and the identity ELX] E[Y] — E[XY] = 0 
which holds by the independence of X and Y, we can write 
Var[X + Y] = E[(X + Y)?] -E[X+Y/? 
= E[X? + Y? 4+ 2XY] — (E[X]? + E[Y]? + 2E[XY]) 
= (E[X7] — E[X]*) + (E[Y*] — E[Y]*) + 2(E[X] E[Y] — E[XY)) 
= Var[X] + Var[Y]. 


The following inequality known as Chebyshev’s inequality bounds the deviation 
of arandom variable from its expectation in terms of its standard deviation. 
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Theorem C.2 Chebyshev’s inequality 
Let X be a random variable with Var[X] < +00. Then, for all t > 0, the following 
inequality holds: 


Pr [|X — E[X]| > tox] < o (C.20) 


Proof Observe that: 

Pr [|X — E[X]| > tox] = Pr[(X — E[X])? > to]. 
The result follows by application of Markov’s inequality to (X — E[X])?. = 
We will use Chebyshev’s inequality to prove the following theorem. 


Theorem C.3 Weak law of large numbers 
Let (Xn)nen be a sequence of independent random variables with the same mean 
and variance 0? < oo. Let X;, = > X;, then, for any € > 0, 


lim Pr[|X, — p| > €] = 0. (C.21) 


Proof Since the variables are independent, we can write 


_ ig X; 2 2 
Var{%] = Jo Var || =< a. 
i=1 


nm nm 


Thus, by Chebyshev’s inequality (with t = €/(Var[X,,])!/), the following holds: 


2 


Pr[[Xn-wl >< 25, 


which implies (C.21).  m 
Example C.3 Applying Chebyshev’s inequality 
Suppose we roll a pair of fair dice n times. Can we give a good estimate of the total 


value of the n rolls? If we compute the mean and variance, we find w = 7n and 
a” = 35/6n (we leave it to the reader to verify these expressions). Thus, applying 


Chebyshev’s inequality, we see that the final sum will lie within 7n + 10 Bn in 
at least 99 percent of all experiments. Therefore, the odds are better than 99 to 1 
that the sum will be between 6.975M and 7.025M after 1M rolls. 


Definition C.12 Covariance 
The covariance of two random variables X and Y is denoted by Cov(X,Y) and 
defined by 


Cov(X,Y) = E[(X — ELX])(Y — E[Y])]. (C.22) 
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It is straightforward to see that two random variables X and Y are independent 
iff Cov(X,Y) = 0. The covariance defines a positive semidefinite and symmetric 
bilinear form: 

= symmetry: Cov(X, Y) = Cov(Y, X) for any two random variables X and Y; 

= bilinearity: Cov(X + X’',Y) = Cov(X,Y) + Cov(X',Y) and Cov(ax,Y) = 
aCov(X,Y) for any random variables X, X’, and Y and a € R; 

= positive semidefiniteness: Cov(X, X) = Var[|X] > 0 for any random variable X. 


The following Cauchy-Schwarz inequality holds for random variables X and Y with 
Var[|X] < +00 and Var[Y] < +00: 


| Cov(X, Y)| < /Var[X] Var[Y]. (C.23) 


The following definition 


Definition C.13 
The covariance matrix of a vector of random variables X = (X1,...,Xyn) is the 
matriz in RN*N denoted by C(X) and defined by 


C(X) = B [(X — B[X])(X — E[X})"]. (C.24) 
Thus, C(X) = (Cov(X;, X;))i,;. It is straightforward to show that 
C(X) = E[XX"] — E[X] E[x]'. (C.25) 
We close this appendix with the following well-known theorem of probability. 


Theorem C.4 Central limit theorem 

Let X1,...,Xn be a sequence of 1.1.d. random variables with mean yt and standard 
deviation o. Let X, = + 07_, Xj and o? = o?/n. Then, (Xn — b)/Tp, converges 
to the N(0,1) in distribution, that is for anyt ER, 


wd ) 
e 2dz. 


Jim Pr(Xn- W/m <4) = f 


oo V2 


Appendix D Concentration inequalities 


In this appendix, we present several concentration inequalities used in the proofs 
given in this book. Concentration inequalities give probability bounds for a random 
variable to be concentrated around its mean, or for it to deviate from its mean or 
some other value. 


D.1  Hoeffding’s inequality 


We first present Hoeffding’s inequality , whose proof makes use of the general 
Chernoff bounding technique. Given a random variable X and ¢« > 0, this technique 
consists of proceeding as follows to bound Pr[|X > ¢]. For any t > 0, first Markov’s 
inequality is used to bound Pr[X > ¢|: 


Prix Se = Prle™ Se") < e Ele |. (D.1) 


Then, an upper bound g(t) is found for E[e*] and t is selected to minimize e~*<g(t). 
For Hoeffding’s inequality, the following lemma provides an upper bound on E[e’*}]. 


Lemma D.1 Hoeffding’s lemma 


Let X be a random variable with E[X] =0 anda < X <b with b> a. Then, for 
any t > 0, the following inequality holds: 


t2(b—a)? 
i ae 


Ele'*] <e (D.2) 


Proof By the convexity of x + e®, for all x € [a,b], the following holds: 


b-2 r—a 
tw < ta the 
= b—a- b—a 
Thus, using E[X] = 0, 
E[e'X] <E b=. ta A—4 _ 6b ett 4 7a et — eft) 


b-—a b—a b-—a b-—a 
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where, 


b —a b —a 
th=l ta tb —ta+l t(b—a) : 
o(t) og (5 a ) a 0 (+ pane 


For any t > 0, the first and second derivative of ¢ are given below: 


aet(o-2) a 
¢'(t) =a a =a a? 
ie =e pao Gee 
jt =abe 
? (t) — [2 e-t(b—a) = 72)? 
b-—a b-a 


_ a(l—aje*O-9(b— a)? 
~ ia) + oP 
a (1—a)e*—-4) 


~ (1 — a)e-*-2) + ag] [(1 — a)e“*0-9 +4 Q] (b— a). 


where a denotes =*. Note that (0) = ¢/(0) = 0 and that $’(t) = u(1—u)(b- a)? 
where u = (acaje=tO=ay pa)" Since w is in [0,1], u(1 — u) is upper bounded by 1/4 
and ¢'(t) < oo Thus, by the second order expansion of function ¢, there exists 
é € [0,t] such that: 


(b—a)? 
8 ’ 


A(t) = (0) +100) + 50") <e (D3) 


which completes the proof. m 
The lemma can be used to prove the following result known as Hoeffding’s inequality. 


Theorem D.1 Hoeffding’s inequality 
Let X,...,Xm be independent random variables with X; taking values in [a;, b;| for 
alli € [1,m]. Then for any € > 0, the following inequalities hold for Sm, = Sv", Xi: 
de Cr aS aa (D.4) 
Pr[Sim — ElSim] < —€] < e70/ Diarra)” (D.5) 


Proof Using the Chernoff bounding technique and lemma D.1, we can write: 


Pr[Sin = E[Sim] > | = et E[e’(Sm—ElSm))) 
= TI, e—* Ble) (independence of X;s) 
< T1™ et et (bi—a1)"/8 (lemma D.1) 
= eat et? Dik (bi-a4)?/8 


< en 2 / Dia (bia)? 
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where we chose t = 4e/5>,".,(b; — a;)? to minimize the upper bound. This proves 
the first statement of the theorem, and the second statement is shown in a similar 
way. mf 


When the variance oX%, of each random variable X; is known and the oX,s are 
relatively small, better concentration bounds can be derived (see Bennett’s and 
Bernstein’s inequalities proven in exercise D.4). 


D.2 McDiarmid’s inequality 


This section presents a concentration inequality that is more general than Hoeffd- 
ing’s inequality. Its proof makes use of a Hoeffding’s inequality for martingale dif- 
ferences. 


Definition D.1 Martingale Difference 
A sequence of random variables V,,V2,... is a martingale difference sequence with 
respect to X1, X2,... if for alli > 0, V; is a function of X,...,X; and 


E[Vi41|X1,---, Xi] =0. (D.6) 
The following result is similar to Hoeffding’s lemma. 


Lemma D.2 
Let V and Z be random variables satisfying E[V|Z] = 0 and, for some function f 
and constant c > 0, the inequalities: 


f(Z2)<V<f(Z re. (D.7) 
Then, for allt > 0, the following upper bound holds: 
He" Za, (D.8) 


Proof The proof follows using the same steps as in that of lemma D.1 with 
conditional expectations used instead of expectations: conditioned on Z, V takes 
values in [a,b] with a = f(Z) and b= f(Z) +c and its expectation vanishes. #™ 


The lemma is used to prove the following theorem, which is one of the main results 
of this section. 


Theorem D.2 Azuma’s inequality 
Let Vi, V2,... be a martingale difference sequence with respect to the random vari- 
ables X1,Xo,..., and assume that for alli > 0 there is a constant c; > 0 and 
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random variable Z;, which is a function of X1,...,X;i-1, that satisfy 


Then, for alle > 0 and m, the following inequalities hold: 


Pr [ov > ( < exp (<2) (D.10) 


i=1 i=1 % 
_ —2¢? 
= i=l 


Proof For any k € [1,mJ, let S, = = 4 V,. Then, using Chernoff’s bounding 
technique, for any t > 0, we can write 


—e-'"R [efSm—2 E[e’v™ Maisatics Xiazall 


< et Ele’Sm—1]et?em/8 (lemma D.2) 


Pr = > é| <e' Ble] 


2 2 
== e726 [se : 


2 m 2 . ‘ F 
<¢*e Loa (iterating previous argument) 


where we chose t = 4e/ 5*\", c? to minimize the upper bound. This proves the first 


statement of the theorem, and the second statement is shown in a similar way. m 


The following is the second main result of this section. Its proof makes use of 
Azuma’s inequality. 


Theorem D.3 McDiarmid’s inequality 

Let X1,...,Xm € &™ be a set of m > 1 independent random variables and 
assume that there exist C1,...,Cm > 0 such that f: &™ — R satisfies the following 
conditions: 


iste espns cate OF ig deg yds tiia || <G, (D.12) 


for alli € [1,m] and any points ©1,...,Um,ai, € X. Let f(S') denote f(X1,...,Xm), 
then, for alle > 0, the following inequalities hold: 


Pr{f(S) — ELf(S)] > J < exp (<=) (D.13) 
Pr[f(S) — ELf(S)] < —e < exp (=z) | (D.14) 


Proof Define a sequence of random variables V;, & € [1,mJ, as follows: V = 
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f(S) —E[f(S)], Vi = E[V|X4] — E[V], and for k > 1, 
V, = E[V|X1,..., Xe] —E[V|X1,...,Xn—1] - 


Note that V = )77., Ve. Furthermore, the random variable E[V|X1,...,X,] is a 
function of X1,...,X,. Conditioning on X,,...,X,_1 and taking its expectation is 
therefore: 


E [B[V |X, ea Xx)|X1, Pinks Xn-1] _ E[V|Xi, bie yok; 


which implies E[Vz|X1,...,Xz—1] = 0. Thus, the sequence (Vi) xef1,m) is a martin- 
gale difference sequence. Next, observe that, since E[f(S)] is a scalar, V, can be 
expressed as follows: 


VY, = E[f(S)|X1,..., Xz] — E[f(S)|X1,..., Xp-1]- 
Thus, we can define an upper bound W; and lower bound U; for Vz by: 


Wy = sup E[f(S)|Xi,.-.,Xx-1, 2] — B[f($)|X,..., Xe-1] 


Uy, = inf Elf (S)|AayenesX paig 8] = EIFS) Aig: 0. Apel 
Now, by (D.12), for any k € [1, mJ, the following holds: 


Wr _ Uk = sup E[f(S)|X1, Shans pK iy ttl 7 E[f(S)|X1, sais Ape | S Ck ’ (D.15) 
thus, Up, < Ve < Up + cy. In view of these inequalities, we can apply Azuma’s 
inequality to V = )>7"., Vi, which yields exactly (D.13) and (D.14). = 


McDiarmid’s inequality is used in several of the proofs in this book. It can be 
understood in terms of stability: if changing any of its argument affects f only in a 
limited way, then, its deviations from its mean can be exponentially bounded. Note 
also that Hoeffding’s inequality (theorem D.1) is a special instance of McDiarmid’s 
inequality where f is defined by f: (a1,...,2%m)@ +07", vi. 


D.3 Other inequalities 


This section presents several other inequalities useful in the proofs of various results 
presented in this book. 
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D.3.1 Binomial distribution: Slud’s inequality 


Let B(m,p) be a binomial random variable and k an integer such that p < } and 
k>mporp< 5 and mp < k < m(1— p). Then, the following inequality holds: 


aoe] (D.16) 


Pr[B > k] > Pr [Lv > 
mp(1 — p) 


where N is in standard normal form. 

D.3.2 Normal distribution: tail bound 

If N is arandom variable following the standard normal distribution, then for u > 0, 
Pr[N > uj > ac oe ee) (D.17) 

D.3.3. Khintchine-Kahane inequality 


The following inequality is useful in a variety of different contexts, including in the 
proof of a lower bound for the empirical Rademacher complexity of linear hypotheses 
(chapter 5). 


Theorem D.4 Khintchine-Kahane inequality 
Let (H,|| - ||) be a normed vector space and let x1,...,Xm be m > 1 elements of 
H. Let o = (01,..-,0m)' 


values in {—1,+1} (Rademacher variables). Then, the following inequalities hold: 


m m 2 m 
5B || oom i < (z | oo }) < ||) ox 
i=1 i=1 i=1 


Proof The second inequality is a direct consequence of the convexity of tx 


with o;s independent uniform random variables taking 


| , (D.18) 


2 


and Jensen’s inequality (theorem B.4). 

To prove the left-hand side inequality, first note that for any (§),...,38m € R, 
expanding the product []/,(1 + §;) leads exactly to the sum of all monomi- 
als go .-. 85m, with exponents 6,,...,dm in {0,1}. We will use the notation 
32... Bm — B5 and |6| = 7, dm for any 6 = (51,...,5m) € {0,1}. In view of 
that, for any (a1,...,Qm) € R™ and t > 0, the following equality holds: 


eT[ata/t) <0 » @l= Sa Pla’, 


i=1 5e{0,1}™ 5€{0,1}™ 
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Differentiating both sides with respect to t and setting t = 1 yields 


2] a+ a) ~Soa;[[a@+a)= S> 2-|d))a%. (D.19) 


j=l iFj 6E{0,1}™ 


For any o € {—1,+1}™, let 5S, be defined by S = ||so|| with s¢ = Soy", oixi. 
Then, setting a; = o;0;, multiplying both sides of (D.19) by S,S,’, and taking the 
sum over all o,o’ € {—1,+1}™ yields 


yY (2[]u+ a0 ad) []t + 0:01)) SoS 
i=1 


o,o'€{—1,41}” j=l iy 


= S- S> (2-|8))o%a”* SoS 


o,o/E€{—1,41}™ deE{0,1}™ 


S> (2-164) x oo Soe (D.20) 


6e{0,1}™ o,o/E{-1,41}™ 

2 
- > (2 [8))| - o°So| . 
dE{0,1}™ oe{-1,+1}™ 


Note that the terms of the right-hand sum with |6| > 2 are non-positive. The terms 
with |6| = 1 are null: since Sg = S_¢, we have bi gcy_1 414 05S, = 0 in that case. 
Thus, the right-hand side can be upper bounded by the term with 6 = 0, that is, 
2( are S| . The left-hand side of (D.20) can be rewritten as follows: 


So emt men )s2 42% SS" S65! 


oe{—1,+1}™ oe{—1,+1}™ 
oa’ €B(a,1) 
= 20" x roi + PN s So( se So _ (m = 2)Sz ) 5) 
oe{—1,+1}™ o€{—1,4+1}™ o'€B(a,1) 


(D.21) 


where B(o,1) denotes the set of o’ that differ from o in exactly one coordinate 
j € [1, m], that is the set of o’ with Hamming distance one from a. Note that for any 
such o', 84 — Sg’ = 20;x; for one coordinate j € [1, m], thus, eeB(e,1) Sg —S¢! = 
2s. In light of that and using the triangle inequality, we can write 


(m = 2) = |Irsel| - [[2sel = || > so|| - | SS desasi 


o'€B(o,1) o'€B(o,1) 


<| eel < +, Son 


o'€B(o,1) o'€B(oa,1) 


Thus, the second sum of (D.21) is non-negative and the left-hand side of (D.20) can 
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be lower bounded by the first sum 2” 5° 
upper bound found for (D.20) gives 


mm > Se he S 


o€{-1,+1}™ oe{—1,+1}™ 


o€{—1,+1}™" $2. Combining this with the 


2 


Dividing both sides by 2?” and using Pr[a] = 1/2™ gives E,[S2] < 2(E,[So])? and 
completes the proof. m 


The constant 1/2 appearing in the first inequality of (D.18) is optimal. To see this, 
consider the case where m = 2 and x; = x2 = x for some non-zero vector x € H. 
Then, the left-hand side of the first inequality is $77", ||xil|? = ||x||? and the 
right-hand side (Eg [||(o1 + o2)x||])” = [|x|[2(Ee [lou + oJ)? = [Ixl]?. 

Note that when the norm || - || corresponds to an inner product, as in the case of 
a Hilbert space H, we can write 


m 2 m m m 
£ [| deems] = 2B [rest asn] = De Blt -as) = DI 
i=1 j=l i,j=l i=1 


since by the independence of the random variables o;, for i # j, Eoloioj| = 
Eo|o] Eo[o;] = 0. Thus, (D.18) can then be rewritten as follows: 


1 m m 2 m 
5 bal? < (B[| ool] < obs. (D.22) 
i=1 i=1 i=1 


D.4 Chapter notes 


The improved version of Azuma’s inequality [Hoeffding, 1963, Azuma, 1967] pre- 
sented in this chapter is due to McDiarmid [1989]. The improvement is a reduction 
of the exponent by a factor of 4. This also appears in McDiarmid’s inequality, which 
is derived from the inequality for bounded martingale sequences. The inequalities 
presented in exercise D.4 are due to Bernstein [1927] and Bennett [1962]; the exercise 
is from Devroye and Lugosi [1995]. 

The binomial inequality of section D.3.1 is due to Slud [1977]. The tail bound 
of section D.3.2 is due to Tate [1953] (see also Anthony and Bartlett [1999]). The 
Khintchine-Kahane inequality was first studied in the case of real-valued variables 
X1,---,Lm by Khintchine [1923], with better constants and simpler proofs later 
provided by Szarek [1976], Haagerup [1982], and Tomaszewski [1982]. The inequality 
was extended to normed vector spaces by Kahane [1964]. The proof presented here 
is due to Latata and Oleszkiewicz [1994] and provides the best possible constants. 


D.5 Exercises STT 


D.5 ~~ Exercises 


D.1 Twins paradox. Professor Mamoru teaches at a university whose computer 
science and math building has F' = 30 floors. 


(1) Assume that the floors are independent and that they have the same 
probability to be selected by someone taking the elevator. How many people 
should take the elevator in order to make it likely (probability more than half) 
that two of them go to the same floor? (Hint: use the Taylor series expansion of 
e-* = 1—a+... and give an approximate general expression of the solution.) 


(2) Professor Mamoru is popular, and his floor is in fact more likely to be 
selected than others. Assuming that all other floors are equiprobable, derive 
the general expression of the probability that two persons go to the same floor, 
using the same approximation as before. How many people should take the 
elevator in order to make it likely that two of them go to the same floor when 
the probability of Professor Mamoru’s floor is .25, .35, or .5? When q = .5, 
would the answer change if the number of floors were instead F = 1,000? 


(3) The probability models assumed in (1) and (2) are both naive. If you had 
access to the data collected by the elevator guard, how would you define a more 
faithful model? 


D.2 Concentration bounds. Let X be a non-negative random variable satisfying 
Pr[X >t] < ce?” for all t > 0 and some c > 0. Show that ELX2] < log(ce) (Hint: 
to do that, use the identity ELX?] = {5° Pr[X? > t]dt, write [>° = fi’ + J°°, bound 
the first term by wu and find the best u to minimize the upper bound). 


D.3 Comparison of Hoeffding’s and Chebyshev’s inequalities. Let X1,...,Xm be 
a sequence of random variables taking values in [0,1] with the same mean py and 
variance 0” < oo and let X = + Oi", Xj. 


(a) For any € > 0, give a bound on Pr{|X —y| > e] using Chebyshev’s inequality, 
then Hoeffding’s inequality. For what values of a is Chebyshev’s inequality 
tighter? 

(b) Assume that the random variables X; take values in {0,1}. Show that 
aox< i. Use this to simplify Chebyshev’s inequality. Choose « = .05 and 
plot Chebyshev’s inequality thereby modified and Hoeffding’s inequality as a 
function of m (you can use your preferred program for generating the plots). 


D.4 Bennett’s and Bernstein’s inequalities. The objective of this problem is to prove 
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these two inequalities. 


(a) Show that for any t > 0, and any random variable X with E[X] = 0, 
E|X"|=¢7, and X <<, 


Elet*] < ef(e7/e) (D.23) 


where 


(b) Show that f”(a) <0 for # > 0. 
(c) Using Chernoff’s bounding technique, show that 


1 ~ —tmet+>2™, f(o%, /c?) 
P oF X; = < t=1 xi 4 
r|+ x2 | se 
where (oX, is the variance of Xj. 

(d) Show that f(x) < f(0) + 2f’(0) = (e* —1-ct)z. 

(e) Using the bound derived in (4), find the optimal value of t. 


(f) Bennett’s inequality. Let X,,...,Xm be independent real-valued random 
variables with zero mean such that for i = 1,...,m, X; < c. Let o? = 
4 o%,- Show that 

m = a 


i+ mo? €C 
— i <e _ 6{ > . D.24 
pr[2 yx >¢ < exp ( 2 (5) ( ) 


where 0(x) = (14+ x) log(1+ 2) — a. 


(g) Bernstein’s inequality. Show that under the same conditions as Bennett’s 
inequality 


P ee <e ie (D.25) 
“Im A Seer, 20? + 2ce/3 ] 


2 


(Hint: show that for all x > 0, 0(x) > h(a) = a a) 


(h) Write Hoeffding’s inequality assuming the same conditions. For what values 
of o is Bernstein’s inequality better than Hoeffding’s inequality? 


Appendix E Notation 


Table E.1 Summary of notation. 


R Set of real numbers 

Ry Set of non-negative real numbers 

R” Set of n-dimensional real-valued vectors 
Ream Set of n x m real-valued matrices 

a, b] Closed interval between a and b 

(a, b) Open interval between a and b 


{a,b,c} Set containing elements a, b and c 


N Set of natural numbers, i.e., {0,1,...} 
log Logarithm with base e 

log, Logarithm with base a 

S An arbitrary set 

|S| Number of elements in S 


sES An element in set S 


x Input space 

ay Target space 

H Feature space 

(-,°) Inner product in feature space 
Vv An arbitrary vector 

1 Vector of all ones 

Ui ith component of v 

\|v|| Lz norm of v 

Iv lp L, norm of v 


uov Hadamard or entry-wise product of vectors u and v 


Notation 


Composition of functions f and g 

Composition of weighted transducers T; and 7 

An arbitrary matrix 

Spectral norm of M 

Frobenius norm of M 

Transpose of M 

Pseudo-inverse of M 

Trace of M 

Identity matrix 

Kernel function over ¥ 

Kernel matrix 

Indicator function indicating membership in subset A 
Generalization error or risk 

Empirical error or risk 

Rademacher complexity over all samples of size m 
Empirical Rademacher complexity with respect to sample S 
Standard normal distribution 


Expectation over « drawn from distribution D 


Kleene closure over a set of characters © 
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expert, 149 

hypothesis, 17-19, 59, 144 

learner, 5 

NFA, 309 

pairwise, 227 


constraint 


Iy-, 259 
affine, 66, 68, 72, 73, 191, 253, 260, 
354 
differentiable, 66, 73, 191, 248 
equality, 354 
qualification 
strong, 355 
weak, 355 


context-free 


grammar, 293, 297 
language, 111, 293 


convex, 72, 83, 126, 161, 218, 257, 352, 


353 

d-gon, 44, 45 

combination, 132-134, 192 

constraint, 68, 72, 73 

domain, 351 

function, 51, 66, 72, 73, 126, 128, 
143, 144, 157, 159, 172, 179, 
191, 192, 196, 205, 207, 208, 
218, 219, 224, 246, 248, 256, 
271-273, 349-354, 356 

hull, 42-44, 132, 220, 350 

intersection, 57 

loss, 128, 147, 153, 156, 157, 159, 
172, 175, 181, 219, 256, 271- 
273, 277 

optimization, 9, 65, 66, 68, 72, 84, 
94, 191, 248, 257, 349, 350, 353- 
357 

polygon, 45 

potential, 141, 142 

QP, 66, 74, 253, 255 
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region, 195 
set, 350-353 
strictly, 66, 351 
upper bound, 72, 73, 126, 128, 218 
convexity, 36, 53, 72, 91, 158, 161, 173, 
180, 181, 207, 218, 248, 352— 
354, 357, 369, 374 
covariance, 366, 367 
matrix, 282, 283, 287, 290, 367 
covering, 61 
numbers, 55, 61, 233 
CRFs, 205, 207 
cross-validation, 140, 256 
n-fold, 5, 6, 28, 72, 87, 198 
error, 5, 6, 86 
leave-one-out, 6 


data 
set, 2 
test, 4 
training, 3 
unseen, 3 
validation, 3 
DCG, 233, 234 
normalized, 233 
decision epoch, 315, 330, 332, 334 
decision stump, 130, 140, 141 
decision trees, 129, 130, 141, 183, 191, 
194, 195, 197, 198, 206, 208, 
263, 299, 300, 302, 310 
binary, 150, 194 
binary space partition trees, 195 
classification, 299 
learning, 195, 197, 206, see also 
Greedy DecisionTrees algorithm 
node, 194 
question, 194-196, 299 
categorical, 194 
numerical, 194 
sphere trees, 195 


stump, see also boosting stump, see 
decision stump 
DFA, 295, 296, 298-300, 302-304, 309- 
311 
acyclic, 295 
consistent, 296 
equivalent, 295 
learning, 303, 309 
minimum consistent, 296 
learning with queries, 298, 303 
minimal, 295, 296, 298, 310 
minimization, 296 
reverse, 304 
VC-dimension, 311 
dichotomy, 41-46 
differentiable 
function, 349, 351, 352, 356 
upper bound, 126, 128 
dimensionality reduction, 2, 7, 101, 281, 
285, 288, 290 
discounted cumulative gain, see DCG 
distribution, 359, 360 
y?-squared, 288, 289, 361 
absolutely continuous, 360 
binomial, 360 
chi-squared, 361 
density function, 360 
Gaussian, 360 
Laplace, 361 
normal, 360 
Poisson, 361 
probability, 359 
distribution-free model, 13 
DNF formula, 20, 311 
disjoint, 310 
doubling trick, 155, 158, 174, 175 
dual, 251 
function, 354 
norm, 342 
optimization, 66-68, 74, 75, 83, 84, 
100, 191, 207, 249, 255, 264, 355 
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problem, 355 

SVM, 164 

SVR, 262 

variables, 67, 70, 74, 264, 354 
duality 

gap, 355 

strong, 68, 355 

weak, 355 
DualPerceptron, 167, 168 


early stopping, 141 
edge, see classifier edge 
emphasis function, 231, 232, 235 
empirical kernel map, see kernel 
empirical risk minimization, 26, 27, 38 
ensemble 
algorithms, 121 
hypotheses, 133, 220 
margin bound, 133 
methods, 121, 122, 220 
ranking, 220 
envelope, 262 
environment, 1, 8, 313, 314, 326, 336 
MDP, 315 
model, 313, 314, 319, 325, 326, 330 
unknown, 336 
Erdos, 48 
ERM, see empirical risk minimization 
error, 12, see also risk 
approximation, 26 
Bayes, 25 
cross-validation, 5 
empirical, 8, 12, 184, 380 
estimation, 26 
generalization, 8, 12, 380 
leave-one-out, 69 
mean squared, 238 
reconstruction, 282 
test, 5 
training, 5 
true, 12 
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event, 30, 118, 119, 359, 361, 362 
elementary, 359 
independent, 361 
indicator, 12 
mutually disjoint, 362 
mutually exclusive, 359 
set, 359 
examples, 3, 11 
iid., 12 
incorrectly labeled, 141 
labeled, 4 
misclassified, 144 
negative, 29 
positive, 19, 303 
unlabeled, 7 
expectation, 363 
linearity, 363 
experience, 1, 336 
expert, 32, 148-154, 156, 157, 168, 169, 
171, 174, 175, 179 
active, 149 
advice, 32, 147, 148 
algorithm, 175 
best, 148, 151, 152, 175 
exploitation, 8, 314 
exploration, 8, 314 
Boltzmann, 334 
exploitation dilemma, 8, 314 
Exponential-Weighted-Average algorithm, 
8, 156, 157, 173, 174 


false negative, 14 

false positive, 14 

error, 87 

rate, 225, 226 
fat-shattered, 244 
fat-shattering, 262 
dimension, 244, 245 
feature, 3 


extraction, 281 
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mapping, 96—98, 102, 117, 167, 189, state-action value, 318, 326, 331, 
190, 214, 247, 252, 254, 255, 332 
281, 284 supremum, 36 

missing, 198 symmetric, 91 
r, 96 

ee game, 138 


relevant, 3, 4, 118, 204 


space, 76, 82, 83, 90, 91, 96, 117 theory, 121, 137, 139, 142, 147, 176, 
118, 140, 194, 213, 246, 247, on 
value 
251, 310, 379 
se oe zero-sum, 138, 139, 174 
vector, 4 gap penalty, 113 
Fermat’s theorem, 349 generalization, 5 


bound, 16, 17, 22, 23, 26, 33, 35, 37, 
38, 40, 48, 54, 55, 59-61, 75, 
77-80, 103, 132-134, 183, 185, 
187, 190, 197, 206, 208, 211, 
213, 237, 239-242, 244, 247, 
251, 254, 255, 259, 262, 264, 
267, 276-278, see also margin 
bound, see also stability bound, 
see also VC-dimension bound 

error, 8, 12, 13, 18, 21, 22, 24-26, 
29, 48, 61, 63, 69, 70, 82, 118, 
131, 136, 144, 148, 172, 174, 
184, 187, 200, 208, 210, 212, 
213, 221, 238, 268, 270, 276 

gradient, 66, 73, 224, 349 

descent, 337, see also stochastic gra- 

dient descent 


final 
state, 107-109, 294, 295, 299-301, 
304-308, 312, 330 
weight, 107, 108, 110, 114 
fixed point, 199, 321, 326, 327, 329, 333 
Frobenius 
norm, 283, 345, 380 
product, 345 
Fubini’s theorem, 49, 363 
function 
affine, 66, 246, 355 
concave, see concave function 
continuous, 91, 96, 120 
contracting, 320, 321 
convex, see convex function 


differentiable, 192, 349, 351, 352, 


356 : 
: Gram matrix, 68, 92, 116, see also kernel 
final weight, 107 : 
matrix 
kernel, 120 


graph, 204, 287 
acyclic, 111 
Laplacian, 286, 291 
neighborhood, 287 
structure, 205 
GreedyDecisionTrees algorithm, 195 


Lipschitz, 78, 80, 96, 186, 188, 212, 
240, 254, 255, 271, 274, 276, 
320, 321 

maximum, 352 

measurable, see measurable func- 


ue growth function, 33, 38-41, 45, 47, 56 
moment-generating, 288, 364, 365, boas 
370 generalization bound, 40 


: lower bound, 56 
quasi-concave, 176 


semi-continuous, 176 Holder’s inequality, 180, 259, 342 
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Halving algorithm, 148-150, 152 
Hamming distance, 184, 201, 202, 204, 
375 
Hessian, 66, 68, 180, 349, 351 
Hilbert space, 89, 91, 94-97, 103, 105, 
116, 117, 119, 342, 376 
pre-, 96 
reproducing kernel, 95, 96, 115, 270 
hinge loss, 72, 73, 82, 83, 177, 276 
quadratic, 72, 73, 278 
Hoeffding’s inequality, 21, 39, 61, 158, 
170, 173, 235, 238, 239, 369— 
371, 373, 377, 378 
horizon, 158, 315 
finite, 315, 316 
infinite, 316, 317 
discounted, 316 
undiscounted, 316 
hyperplane, 42, 63 
canonical, 65 
VC-dimension, 76 
equation, 64 
marginal, 65 
maximum-margin, 64 
minimal error, 84 
optimal, 83 
pseudo-dimension, 242 
soft-margin, 84 
tangent, 271 
VC-dimension, 42 
hypothesis, 4 
Bayes, 25 
best-in-class, 26 
consistent, 17 
linear, 63 
set, 4, 12 
finite, 8, 11 
infinite, 8, 33 
single, 22 


iid., 361 
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identification in the limit, see language 
identification in the limit 
impurity, 196, 197 
entropy, 196 
Gini index, 196 
mean squared error, 198 
misclassification, 196 
inconsistent, 11 
case, 21, 239 
hypothesis, 21 
independence, see random variable inde- 
pendence 
pairwise on irrelevant alternatives, 
228 
inequality 
Azuma’s, 172, 371-373, 376 
Bennett’s, 371, 377, 378 
Bernstein’s, 371, 377, 378 
Cauchy-Schwarz, 77, 94, 96, 102, 
162, 180, 190, 273, 275, 342, 
343, 367 
Chebyshev’s, 365, 366, 377 
concentration, see concentration in- 
equalities 
Holder’s, 180, 259, 342 
Hoeffding’s, 21, 39, 61, 158, 170, 
173, 235, 238, 239, 369-371, 
373, 377, 378 
Jensen’s, 36, 39, 53, 76, 77, 102, 158, 
190, 353, 374 
Khintchine-Kahane, 103, 156, 374, 
376 
Markov’s, 288, 363, 366, 369 
McDiarmid’s, 33, 35, 36, 117, 269, 
371-373, 376 
Pinsker’s, 279 
Young’s, 343 
inference 
automata, 303, 307 
transductive, 7 
input space, 11 
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instances, 3, 11 
sparse, 177 
weighted, 143 
interaction, 1, 313, 314 
Isomap, 285, 286, 290 


Jensen’s inequality, 36, 39, 53, 76, 77, 
102, 158, 190, 353, 374 
Johnson-Lindenstrauss lemma, 288-290 


Karush-Kuhn-Tucker conditions 
see KKT conditions, 356 
kernel, 89, 90 
bigram, 113 
gappy, 113 
continuous, 115 
convolution, 115 
difference, 116 
empirical map, 96-98, 260 
functions, 89, 90 
Gaussian, 94 
matrix, 92 
methods, 89, 90 
n-gram, 120 
negative definite symmetric, 89, 103 
normalized, 97 
polynomial, 92, 117 
positive definite symmetric, 8, 89, 
91, 92 
closure properties, 99 
positive semidefinite, 92 
rational, 8, 83, 89, 106, 111, 1138, 
115, 119, 310 
PDS, 112-115 
ridge regression, see KRR 
sequence, 106, 112, see also kernel 
rational 
sigmoid, 94 
string, 115 
tensor product, 99 
KernelPerceptron, see Perceptron algo- 
rithm kernel 


Khintchine-Kahane inequality, 103, 156, 
374, 376 

KKT conditions, 66, 73, 191, 249, 253, 
255, 356, 357 

KPCA, see PCA kernel 

Kullback-Leibler divergence, 279 


labels, 3, 8, 11, 25, 31, 42 
categories, 3 
real-valued, 3 
target, 96 
true, 5 
values, 3 
Lagrange, 357 
function, 354, see also Lagrangian 
multipliers, 85, 86 
variables, 66, 73, 74, 354 
Lagrangian, 66, 67, 73, 74, 191, 248, 253, 
255, 354-357 
language 
k-reversible, 310-312 
accepted, 295, 296, 304, 307 
complement, 110 
context-free, see context-free lan- 
guage 
formal, 339 
identification in the limit, 294, 303, 
308, 310 
learning, 9, 293, 294, 303 
linearly separable, 115 
positive presentation, 308 
regular, 293, 295, 310 
reverse, 304 
reversible, 304, 305, 308-310 
learning, 311 
Laplacian eigenmaps, 285-288, 290, 291 
Lasso, 9, 237, 245, 257-260, 266, 277 
group, 261 
on-line, see OnLineLasso algorithm 
law of large numbers 
strong, 326, 327 
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weak, 366 
learner, 7 
active, 296, 313 
base, 123, 127, 130, 136, 139, 143, 
144, 191 
consistent, 5 
passive, 313 
strong, 122 
weak, 121, 129, 130, 136, 141, 143, 
194, 206, 214 
learning, 115, 313 
active, 8 
exact, 294, 295 
on-line, 7 
policy, 334 
problem, 314 
randomized, 153 
reinforcement, 8 
semi-supervised, 7 
supervised, 7 
transductive, 7 
unsupervised, 7 
with queries, 297 
learning bound, see generalization bound 
consistent case, 17 
finite hypothesis set, 17, 23 
inconsistent case, 23 
LearnReversibleA utomata algorithm, 303, 
304, 306-310 
lemma 
contraction, see Talagrand’s lemma 
Hoeffding’s, 369 
Johnson-Lindenstrauss, 288-290 
Massart’s, 39, 40, 54, 56, 258 
Sauer’s, 45-48, 55, 56, 58 
Talagrand’s, 56, 78, 186, 240, 254 
linearly separable, 70, 71, 77, 83, 90, 
93, 115, 118, 140, 162-164, 166, 
167, 224, see also realizable set- 
ting 
Lipschitz 


function, see function Lipschitz 
property, 79, 321 
LLE, 287, 288, 290, 292 
locally linear embedding, see LLE 
logistic regression, 128, 129, 141, 142 
loss 
€-insensitive, 252 
quadratic, 255 
o-admissible, 271 
average, 172 
binary, see loss, zero-one 
bounded, 171 
convex, 128 
convex upper bound, 126, 128 
cumulative, 148 
expected, 139 
exponential, 126 
function, 4, 34, 238 
Hamming, 204 
hinge, see hinge loss 
Huber, 256 
logistic, 128 
margin, 77, 185 
empirical, 78 
matrix, 138 
misclassification, 4 
multi-label, 192 
non-convex, 181 
non-differentiable, 277 
pairwise ranking, 213 
exponential, 218 
ranking 
disagreement, 227 
top k, 232 
squared, 4, 148, 238 
unbounded, 238 
zero-one, 4, 37, 148, 154 
pairwise misranking, 218 


MBN, 205, 207 


405 


406 INDEX 


manifold learning, 2, 281, 284, 285, 290, trace, 103, 344, 346 
see also dimensionality reduc- transpose, 344 
tion upper triangular, 346 
margin, 63, 64, 75, 162, 185 maximum likelihood, 129 
Ly-, 181, 132 Maximum-Margin Markov Networks, see 
bound, 8, 80 M?N 
geometric, 75 McDiarmid’s inequality, 33, 35, 36, 117, 
hard, 71 269, 371-373, 376 
loss, 77, 78, 185 MDP, 313, 314 
empirical, 78 environment, 315 
maximum-, 64, 65, 136, 137, 140, finite, 315 
177, 233 partially observable, 336 
multi-class, 185 mean, 363, 366, 367, 369, 373, 377 
pairwise ranking, 211 estimation, 326 
soft, 71, 84, 141, 142 zero-, 360, 364, 378 
theory, 8, 64, 75, 83, 121, 137 measurable, 12, 34, 359 
margin bound function, 25, 118, 248, 353 
binary classification, 80 subset, 237 
covering numbers, 233 Mercer’s 
ensemble condition 
Rademacher complexity, 133 see condition Mercer’s, 396 
ranking, 220 theorem, 91 
VC-Dimension, 133 metric space, 320 
kernel-based hypotheses, 103 complete, 320, 321 
multi-class classification, 187, 190 mirror image, 304 
ranking, 212, 234 mistake, 149-152, 171, 177 
kernel-based hypotheses, 213 bound, 8, 149-151, 161, 166, 169, 
MarginPerceptron, 177, 178 171, 176 
Markov decision process, see MDP cumulative, 153 
Markov’s inequality, 288, 363, 366, 369 model, 148, 171 
martingale differences, 371, 373, 376 rate, 150 
Massart’s lemma, 39, 40, 54, 56, 258 model 
matrix, 344 based approach, 326 
Gram, 68 continuous-time, 315 
identity, 66 discrete-time, 315 
kernel, 92 distribution-free, 13 
loss, 138 free approach, 326 
multiplication, 108 selection, 5, 6, 27 
norm moment-generating function, 288, 364, 
induced, 344 365, 370 


positive semidefinite, 346 mono-label case, 183-185, 207 


INDEX 


multi-label 
case, 183, 184, 192, 207 
error, 207 
loss, 192 


n-way composition, 113, 115 
NDCG, see DCG normalized 
NDS kernel, see kernel negative-definite 
symmetric 
NFA, 295, 309 
consistent, 309 
node impurity, see impurity 
noise, 25, 26, 30, 54, 140-142, 144 
assumption, 26 
average, 25, 26 
learning in presence of, 30 
model, 31 
random, 34, 141, 142, 328 
rate, 30, 31 
source, 198 
non-convex 
loss, 181 
non-differentiable loss, 271, 277 
non-realizable case, 11, 33, 50, 51, 54, 55, 
150 
norm, 341 
equivalent, 341 
Frobenius, 345 
group, 189, 261, 345 
matrix, see matrix norm 
spectral, 344 
vector, see vector norm 


Occam’s razor principle, 24, 29, 48, 63, 
239, 296 

on-line learning, 147 

on-line to batch conversion, 147, 171, 
176, 181 

On-line-SVM algorithm, 177 

one-versus-all, 8, 198-202, 206 

one-versus-one, 8, 199-202, 208 
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one-versus-rest, see one-versus-all 
OnLineDualSVR algorithm, 262 
OnLineLasso algorithm, 262, 265, 266 
operator norm, 344 
optimization 

constrained, 354 

dual, 355 

primal, 354 
outlier, 71, 72, 74, 141 
OVA, see one-versus-all 
OVO, see one-versus-one 


PAC-learning, 8, 11, 13, 14, 16, 18-21, 
26, 28-33, 54, 59, 121, 147 
agnostic, 24, 25, 50 
algorithm, 13, 14, 18, 32, 58 
efficiently, 13 
model, 11, 13, 14, 20, 24, 28, 29 
weakly, 121 
with membership queries, 297 
packing numbers, 55 
pairwise consistent, 227 
paradigm 
state-partitioning, 303 
state-splitting, 303 
parse tree, 106 
partially observable Markov decision 
process, see POMDP 
path, 107-111, 114, 115, 161, 175, 294, 
295 
e-, 109, 110, 115 
accepting, 107, 108, 111, 112, 114, 
294, 295, 305 
label, 107 
matching, 109 
redundant, 109 
shortest- problem 
on-line, 175 
successful, see accepting 
PCA, 9, 281 
kernel, 9, 281, 283-288, 290, 292 
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PDS kernel, see kernel positive-definite 
symmetric 
Perceptron algorithm, 8, 84, 147, 159- 
163, 166-169, 171, 176-178, 234 
dual, 167, 168 
kernel, 168, 176, 181 
margin, see MarginPerceptron 
ranking, see RankPerceptron 
update, 177 
voted, 163, 168 
Pinsker’s inequality, 279 
pivot, 230 
planning, 9 
algorithm, 319 
problem, 313, 314, 319 
policy, 313-315, 322, 326 
e-greedy, 333 
iteration, 319, 322-324, 337, see also 
Policylteration algorithm 
learning, 334 
non-stationary, 316 
stationary, 315 
value, 313, 316 
Policylteration algorithm, 323 
Polynomial-Weighted-Average algorithm, 
179 
POMDP, 336 
positive semidefinite, 92, 346 
potential function, 151, 152, 154, 157, 
170, 179, 180 
precision, 232 
average, 232 
preference 
-based 
ranking, 9 
setting, 209, 210, 226, 227, 233 
function, 210, 211, 226-230 
prefix, 114, 294, 301, 304, 308 
principal component analysis, see PCA 
prior knowledge, 4, 96, 98 
probabilistic method, 48, 55, 288 


probability, 359 
conditional, 361 
distribution, 359 
joint mass function, 359 
mass function, 359 
theorem of total, 362 
probably approximately correct, see PAC 
pseudo-dimension, 237, 239, 242-245, 
262 
pseudo-inverse, 98, 246, 287, 346 


Q-learning 
algorithm, 326, 330-332, 334, 335, 
337 
update, 332 
QP, 66, 68, 83, 85, 192, 200, 205, 253, 
255, 259, 260 


convex, 66, 74 
quadratic programming, see QP 
query 
equivalence, 297, 298, 300, 303, 311 
membership, 297-303, 311 
subset, 226, 227 
QueryLearnAutomata algorithm, 298, 
300 
QuickSort algorithm, 230 
randomized, 230, 231, 234 


Rademacher complexity, 8, 33-40, 54, 56, 
63, 78, 84, 133, 134, 183, 189, 
190, 209, 211, 213, 220, 233, 
237, 239, 241, 245, 267, 380 

L, loss functions, 240 

binary classification bound, 37 

bound, 48, 240, 254, 259 

convex combinations, 132, 133 

empirical, 34, 37, 38, 55, 77, 102, 
103, 186, 380 

generalization bounds, 103 

kernel-based hypotheses, 102, 247 

linear hypotheses, 77 
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linear hypotheses with bounded 1 
norm, 257, 258 
local, 54 
margin bound 
binary classification, 80 
ensembles, 133 
multi-class classification, 187 
ranking, 212 
multi-class kernel-based hypotheses, 
189, 206 
regression bound, 239, 240, 262 
Rademacher variables, 34 
radial basis function, 94 
Radon’s theorem, 43, 44 
random variable, 359 
independence, 39, 76, 289, 327, 361, 
363, 365, 370, 376 
independent, 363, 365, 367 
measurable, 359 
moment-generating function, 364 


Randomized- Weighted-Majority algorithm, 


147, 153-155, 175, 179 
rank aggregation, 233 
RankBoost, 8, 206-209, 214-220, 222— 
224, 233-235 
ranking, 2, 7, 209, 229 
bipartite, 221, 234 
multipartite, 235 
RankBoost, 214 
with SVMs, 213 
RankPerceptron, 234 
rate 
false positive, 225, 226 
true positive, 225, 226, 232 
rational kernel, 8, 83, 89, 106, 111, 118, 
115, 119, 310 
PDS, 112-115 
Rayleigh quotient, 283, 346 
RBF, see radial basis function 
realizable case, 11, 49, 55, 59, 149-152, 
162, 163 


recall, 232 

regression, 2, 237 
boosting trees, 263 
decision trees, 263 
group norm, 260 
KRR, 245, 247 
Lasso, 245, 257 
linear, 237, 245 
neural networks, 263 
on-line, 261 
ordinal, 234 
ridge, see KRR 
SVR, 245, 252 
unbounded, 238, 262 
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regret, 148, 152, 154-157, 159, 172, 173, 


175, 179-181, 228, 229 
average, 155 


bound, 157-159, 174, 175, 179, 180, 


209, 229 
second-order, 179 
cumulative, 179 
external, 148, 175, 176 
instantaneous, 179, 180 
internal, 175, 176 
lower bound, 155 
minimization, 173-175, 179 
per round, 155 
preference function, 228, 229 
ranking, 228 
swap, 175, 176 
weak, 228 
regular 
expression, 114, 295 
language, 295 
regularization, 28, 142, 246 
Iy-, 141, 142 
-based algorithm, 28 
parameter, 28, 181, 197 
path, 259 
term, 28, 248, 250, 257, 271 
regularizer, 28 
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relative entropy, 142, 170, 171, 279 
representer theorem, 101, 115 
reproducing 
kernel Hilbert space, see Hilbert 
space 
property, 95 
reward, 8, 313-316, 330, 332, 335 
cumulative, 318 
delayed, 314 
deterministic, 315, 317, 331 
expected, 316, 319, 322 
future, 316, 335 
immediate, 8, 314, 316, 326, 335 
long-term, 8 
probability, 315, 319, 325, 326 
vector, 320 
risk, 12, 380, see also error 
empirical, 12, 380 
minimization, see ERM 
empirical minimization, 27 
penalized empirical, 181 
structural 
minimization, see SRM 
RKHS, see Hilbert space 
ROC curve, 209, 224-226, 233, see also 
AUC 
algorithm, Randomized- 
Weighted-Majority algorithm 


RWM 


see 


saddle point, 356, 357 
necessary condition, 356 
sufficient condition, 356 
sample 
complexity, 1, 11, 14, 16-18, 29, 30, 
33, 52, 58 
test, 4 
training, 3 
validation, 3 
sample space, 359 
SARSA algorithm, 334, 335 
Sauer’s lemma, 45-48, 55, 56, 58 


scenario 
deterministic, 25, 184, 210, 237 
randomized, 153 
stochastic, 24, 25, 147, 184, 210, 227, 
237 
score-based setting, 209, 211, 214, 221, 
226, 227, 233 
scores, 4 
scoring function, 185, 189, 199, 202, 203, 
210, 211, 235 
sequence, 90, 106, 110, 111 
kernel, 89, 106, 108, 111, 112 
bigram, 113 
mapping, 111 
protein, 106 
similarity, 106 
stochastic, 155 
sequential minimal optimization algo- 
rithm, see SMO algorithm 
setting 
deterministic, 25 
stochastic, 24, 25, 171 
shattering, 41, 241 
coefficient, 55 
witness, 241 
shortest-distance algorithm, 108, 111, 
115 
all-pairs, 286 
singular 
value, 283-288, 344-346 
value decomposition, see SVD 
vector, 282-288, 291, 346, 347 
slack variable, 71, 84, 191, 206, 214, 222, 
248, 252 
SMO algorithm, 68, 83, 85, 86 
sort-by-degree algorithm, 229 
SPSD, see symmetric positive semidefi- 


nite 

SRM, 27-29 

stability, 233, 251, 256, 267-270, 277, 
278, 372, 373 
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bound, 268, 277 
KRR, 275, 278 
ranking, 277 
regression, 278 
SVM, 276, 278 
SVR, 274 
coefficient, 267, 268, 270-276 
kernel, 263, 278 
stable, 268, 273 
standard deviation, 6, 86, 365, 367 
standard normal 
distribution, 289, 290, 292, 360, 361, 
364, 374 
form, 374 
random variable, 289 
state, 107, 313, 315 
destination, 294 
final, 107 
initial, 107, 315 
origin, 294 
start, 315 
state-action 
pair, 332 
value, 333, 334 
value function, see function 
stationary point, 349 
stochastic 
approximation, 326 
gradient descent, 161, 177, 261, 263, 
266 
optimization, 327, 337 
stochasticity, 318 
strategy, 139 
grow-then-prune, 197 
mixed, 138, 139 
pure, 138, 139 
string, 107, 108, 112, 113, 119, 294, 295, 
298-300, 303-305, 307-312 
accepted, 295, 296, 304, 305 
access, 299 
counter-example, 300 


pit 


distinguishing, 299, 301 
empty, 106, 294, 295 
finality, 299 
kernel, 106 
leaf, 300 
negative, 296 
partition, 299, 301 
positive, 296, 306 
rejected, 296, 309 
structural risk minimization, see SRM 
structure, 203 
structured 
output, 203, 204 
prediction, 2, 183, 184, 203-205, 207 
subgradient, 272, 273 
subsequence, 119 
subsequences, 106 
substring, 106 
sum rule, 362 
supermartingale convergence, 328, 329 
support vector, 67, 74, 162 
machine, see SVM 
networks, 83 
regression, see SVR 
SVD, 98, 99, 345 
SVM, 8, 63-75, 82-87, 89-91, 94, 100- 
102, 106, 115, 118, 119, 131, 
137, 142, 143, 162-164, 166-— 
168, 176, 177, 191, 192, 200, 
201, 205, 209, 213, 214, 222, 
233, 252, 253, 255, 256, 267, 
271, 276, 278 
multi-class, 8, 183, 
206 
ranking with, 8, 213, 214, 233, 234 
regression, see SVR 
SVMStruct, 205 
SVR, 237, 245, 252, 255-257, 260, 261, 
263, 267, 271, 274, 275 
dual, 262, 264 
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on-line, see OnLineDualSVR. al- 
gorithm 
Huber loss, 264 
on-line, 263 
quadratic, 255, 256, 264 
on-line, 266 
stability, 274 


target 
concept, 12 
values, 11 


TD(A) algorithm, 335, 336 
TD(0) algorithm, 330, 331, 335 
theorem 

central limit, 367 

Fermat’s, 349 

Fubini’s, 49, 363 

Mercer’s, 91 

Radon’s, 43, 44 

representer, 101 

von Neumann’s minimax, 139, 174 
transducer 

acyclic, 108 

composition, 108, 109, 115, 380 

counting, 113, 114 

inverse, 112 

weighted, 106-109, 111-113 
transition, 107-112, 114, 294, 295, 299- 

301, 304, 306-308, 310, 315- 
317, 322, 326 

label, 107 

probability, 315, 317-320, 325, 326 
trigrams, 90 
true positive rate, 225, 226, 232 


uniform convergence bound, 17, 23 
uniform stability, see stability 
uniformly (-stable, see stable 
union bound, 15, 362 
update rule, 85, 169, 334 

additive, 169 


multiplicative, 169, 176 


value iteration, 319, 324, see also Val- 
uelteration algorithm 
Valuelteration algorithm, 320 
variance, 6, 54, 70, 166, 282-284, 289, 
290, 365, 366, 371, 377, 378 
unit, 287, 288, 360 
VC-dimension, 8, 33, 41 
ensemble margin bound, 133 
generalization bound, 48 
lower bounds, 48, 49, 51 
vector, 341 
norm, 341, 344, 345 
singular 
left, 345, 346 
right, 345, 346 
space, 341, 342 
normed, 374 
von Neumann’s minimax theorem, 139, 


174 


weight function, 231, 235 
Weighted-Majority algorithm, 147, 150- 
152, 154, 156, 169, 175, see also 
Randomized-Weighted-Majority 
algorithm 
Widrow-Hoff algorithm, 261 
on-line, 263 
Winnow 
algorithm, 8, 147, 159, 168-171, 176 
update, 169 
WM algorithm, see Weighted-Majority 
algorithm 


Young’s inequality, 343 
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